6.1.3. Train, Validation, and Test Split#
One of the most fundamental concepts in machine learning - and one where beginners make the most costly mistakes - is how to properly split your data. Understanding this concept isn’t just important, it’s absolutely critical to building models that work in the real world.
6.1.3.1. The Problem: Why Not Train on All Data?#
Imagine you’re studying for an exam. You have a textbook with practice problems and answers. Here’s the question:
Should you memorize the answers to those specific problems?
Of course not! The exam will have different problems. You need to understand the concepts, not memorize specific answers.
This same principle applies to machine learning. If we train a model on all available data and then test it on that same data, we’re essentially asking: “Can you memorize?” instead of “Can you generalize?”
Warning
Critical Mistake: Training and testing on the same data
If you train and test on the same data, you’re measuring memorization, not generalization. Your model will appear to perform perfectly in development but fail miserably in production.
6.1.3.2. The Three-Way Split#
To properly evaluate and tune a machine learning model, we typically split data into three sets:
Training Set: Used to train the model (learn parameters)
Validation Set: Used to tune hyperparameters and select models
Test Set: Used for final, unbiased evaluation
Think of it like this:
Training set: The textbook you study from
Validation set: Practice exams to check your preparation
Test set: The actual final exam
Example#
Let’s say we have a dataset of Customer Purchase Prediction with 1000 samples.
It has features:
Age
Income (thousands)
Months as Customer
The label (target) is Will Purchase (0=No, 1=Yes)
Data Split:
Training set: 600 samples (60%)
Validation set: 200 samples (20%)
Test set: 200 samples (20%)
6.1.3.3. Training Set: Where Learning Happens#
The training set is where your model learns patterns. This is the only data the model sees during training.
Purpose:
Fit model parameters (weights, coefficients)
Learn the underlying patterns
Build the internal representation
Size: Typically 60-80% of your data
High training accuracy doesn’t guarantee good real-world performance. We need validation/test sets to know if the model generalizes.
Note
Training Set Rules:
Model sees this data during training
Used to learn parameters
Largest portion of your data
Cannot be used for evaluation (biased!)
6.1.3.4. Validation Set: For Hyperparameter Tuning#
The validation set is used to make decisions about your model during development.
Purpose:
Tune hyperparameters (learning rate, tree depth, etc.)
Select between different models
Decide when to stop training (early stopping)
Make architectural choices
Size: Typically 10-20% of your data
Important
Why We Need a Validation Set:
Without a validation set, you would tune hyperparameters using the test set, which leads to:
Overfitting to the test set
Overly optimistic performance estimates
Models that fail in production
The validation set gives you a “practice exam” to tune your approach.
6.1.3.5. Test Set: The Final Honest Evaluation#
The test set is your one-time, final evaluation. It’s like the real exam - you only get one shot.
Purpose:
Provide an unbiased estimate of model performance
Simulate real-world performance
Final “go/no-go” decision
Size: Typically 10-20% of your data
CRITICAL RULE: Never look at the test set until you’re completely done with development!
The test accuracy is our honest estimate of real-world performance. This is the number we report and trust for production deployment.
Warning
Never Touch the Test Set Until the End!
The test set must remain locked away until you:
Finished selecting models
Finished tuning hyperparameters
Finished making all development decisions
If you use the test set multiple times, it becomes part of your development process and loses its value as an unbiased estimator.
6.1.3.6. Common Split Ratios#
Different problems call for different split ratios. Here are common approaches:
Split Ratio |
Use Case |
|---|---|
60/20/20 |
Standard for medium datasets (1,000-100,000 samples) |
70/15/15 |
When you need more training data |
80/10/10 |
Large datasets where even 10% is substantial |
70/30 (no val) |
Simple models with few hyperparameters |
6.1.3.7. Information Leakage: The #1 Beginner Mistake#
Information leakage occurs when information from the training set “leaks” into your validation or test sets, giving you an unrealistically optimistic view of model performance.
What is Information Leakage?#
Leakage happens when you use information from the entire dataset before splitting. The most common forms:
Preprocessing before splitting (scaling, normalization, imputation)
Feature selection on all data
Oversampling before splitting
Using test data statistics
The Wrong Way: Leakage Example#
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# WRONG: Scale before splitting (causes data leakage)
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X)
# Train model
model_wrong = LogisticRegression(max_iter=1000, C=0.001)
model_wrong.fit(X_train, y_train)
train_score_wrong = model_wrong.score(X_train, y_train)
test_score_wrong = model_wrong.score(X_test, y_test)
Problem: Scaler fit on ALL 1300 samples (including test data!)
Model performance (WRONG way):
Training accuracy: 92.5%
Test accuracy: 54.7%
What’s wrong with this:
Scaler saw test data statistics (means and standard deviations)
Test set normalized using information it shouldn’t have access to
This LEAKS information from test into training
Results are overly optimistic and won’t generalize to production
In production, you won’t have future data to scale with!
The Right Way: No Leakage#
# Fit scaler only on training data
scaler_right = StandardScaler()
scaler_right.fit(X_train)
# Transform both sets using training statistics
X_train_scaled_right = scaler_right.transform(X_train)
X_test_scaled_right = scaler_right.transform(X_test)
# Train model
model_right = LogisticRegression(max_iter=1000, C=0.001)
model_right.fit(X_train_scaled_right, y_train)
train_score_right = model_right.score(X_train_scaled_right, y_train)
test_score_right = model_right.score(X_test_scaled_right, y_test)
Split data first: 1000 train, 300 test
Scaler fit on 1000 training samples only
Test set transformed using only training data statistics
Model performance (RIGHT way):
Training accuracy: 95.6%
Test accuracy: 49.0%
Why this is correct:
Scaler fit ONLY on training data
Test set treated as truly unseen data
Results are honest - reflects true performance on new data
Simulates real-world deployment scenario
Comparison: Wrong vs Right
Wrong way test accuracy: 54.7% (overly optimistic - used test data statistics)
Right way test accuracy: 49.0% (honest estimate - only used training statistics)
Difference of 5.7 percentage points represents the bias introduced by data leakage
Key insight: Even with features at different scales, leakage from improper preprocessing can significantly inflate your performance estimates!
The Golden Rule of Data Splitting#
Important
The Golden Rule:
SPLIT FIRST, THEN PREPROCESS
Split your data into training, validation and testing sets.
Fit preprocessing ONLY on training data.
Transform all sets using training statistics.
Never let information from val/test set influence preprocessing.