6.1.3. Train, Validation, and Test Split#

One of the most fundamental concepts in machine learning - and one where beginners make the most costly mistakes - is how to properly split your data. Understanding this concept isn’t just important, it’s absolutely critical to building models that work in the real world.

6.1.3.1. The Problem: Why Not Train on All Data?#

Imagine you’re studying for an exam. You have a textbook with practice problems and answers. Here’s the question:

Should you memorize the answers to those specific problems?

Of course not! The exam will have different problems. You need to understand the concepts, not memorize specific answers.

This same principle applies to machine learning. If we train a model on all available data and then test it on that same data, we’re essentially asking: “Can you memorize?” instead of “Can you generalize?”

Warning

Critical Mistake: Training and testing on the same data

If you train and test on the same data, you’re measuring memorization, not generalization. Your model will appear to perform perfectly in development but fail miserably in production.

6.1.3.2. The Three-Way Split#

To properly evaluate and tune a machine learning model, we typically split data into three sets:

  1. Training Set: Used to train the model (learn parameters)

  2. Validation Set: Used to tune hyperparameters and select models

  3. Test Set: Used for final, unbiased evaluation

Think of it like this:

  • Training set: The textbook you study from

  • Validation set: Practice exams to check your preparation

  • Test set: The actual final exam

Example#

Let’s say we have a dataset of Customer Purchase Prediction with 1000 samples.

It has features:

  • Age

  • Income (thousands)

  • Months as Customer

The label (target) is Will Purchase (0=No, 1=Yes)

Data Split:

  • Training set: 600 samples (60%)

  • Validation set: 200 samples (20%)

  • Test set: 200 samples (20%)

6.1.3.3. Training Set: Where Learning Happens#

The training set is where your model learns patterns. This is the only data the model sees during training.

Purpose:

  • Fit model parameters (weights, coefficients)

  • Learn the underlying patterns

  • Build the internal representation

Size: Typically 60-80% of your data

High training accuracy doesn’t guarantee good real-world performance. We need validation/test sets to know if the model generalizes.

Note

Training Set Rules:

  • Model sees this data during training

  • Used to learn parameters

  • Largest portion of your data

  • Cannot be used for evaluation (biased!)

6.1.3.4. Validation Set: For Hyperparameter Tuning#

The validation set is used to make decisions about your model during development.

Purpose:

  • Tune hyperparameters (learning rate, tree depth, etc.)

  • Select between different models

  • Decide when to stop training (early stopping)

  • Make architectural choices

Size: Typically 10-20% of your data

Important

Why We Need a Validation Set:

Without a validation set, you would tune hyperparameters using the test set, which leads to:

  • Overfitting to the test set

  • Overly optimistic performance estimates

  • Models that fail in production

The validation set gives you a “practice exam” to tune your approach.

6.1.3.5. Test Set: The Final Honest Evaluation#

The test set is your one-time, final evaluation. It’s like the real exam - you only get one shot.

Purpose:

  • Provide an unbiased estimate of model performance

  • Simulate real-world performance

  • Final “go/no-go” decision

Size: Typically 10-20% of your data

CRITICAL RULE: Never look at the test set until you’re completely done with development!

The test accuracy is our honest estimate of real-world performance. This is the number we report and trust for production deployment.

Warning

Never Touch the Test Set Until the End!

The test set must remain locked away until you:

  • Finished selecting models

  • Finished tuning hyperparameters

  • Finished making all development decisions

If you use the test set multiple times, it becomes part of your development process and loses its value as an unbiased estimator.

6.1.3.6. Common Split Ratios#

Different problems call for different split ratios. Here are common approaches:

Split Ratio

Use Case

60/20/20

Standard for medium datasets (1,000-100,000 samples)

70/15/15

When you need more training data

80/10/10

Large datasets where even 10% is substantial

70/30 (no val)

Simple models with few hyperparameters

6.1.3.7. Information Leakage: The #1 Beginner Mistake#

Information leakage occurs when information from the training set “leaks” into your validation or test sets, giving you an unrealistically optimistic view of model performance.

What is Information Leakage?#

Leakage happens when you use information from the entire dataset before splitting. The most common forms:

  1. Preprocessing before splitting (scaling, normalization, imputation)

  2. Feature selection on all data

  3. Oversampling before splitting

  4. Using test data statistics

The Wrong Way: Leakage Example#

Hide code cell source

import numpy as np

np.random.seed(42)

n1 = 1000

# Informative feature (small scale)
f1_train = np.random.normal(0, 1, n1)

# Large magnitude noise feature
f2_train = np.random.normal(0, 5000, n1)

y_train = (f1_train > 0).astype(int)

X_train = np.column_stack([f1_train, f2_train])

n2 = 300

# Same signal but shifted
f1_test = np.random.normal(4, 1, n2)

# Massive scale + mean shift in noise
f2_test = np.random.normal(50000, 20000, n2)

y_test = (f1_test > 4).astype(int)

X_test = np.column_stack([f1_test, f2_test])


X = np.vstack([X_train, X_test])
y = np.concatenate([y_train, y_test])
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# WRONG: Scale before splitting (causes data leakage)
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X)


# Train model
model_wrong = LogisticRegression(max_iter=1000, C=0.001)
model_wrong.fit(X_train, y_train)

train_score_wrong = model_wrong.score(X_train, y_train)
test_score_wrong = model_wrong.score(X_test, y_test)

Problem: Scaler fit on ALL 1300 samples (including test data!)

Model performance (WRONG way):

  • Training accuracy: 92.5%

  • Test accuracy: 54.7%

What’s wrong with this:

  • Scaler saw test data statistics (means and standard deviations)

  • Test set normalized using information it shouldn’t have access to

  • This LEAKS information from test into training

  • Results are overly optimistic and won’t generalize to production

  • In production, you won’t have future data to scale with!

The Right Way: No Leakage#

# Fit scaler only on training data
scaler_right = StandardScaler()
scaler_right.fit(X_train)

# Transform both sets using training statistics
X_train_scaled_right = scaler_right.transform(X_train)
X_test_scaled_right = scaler_right.transform(X_test)

# Train model
model_right = LogisticRegression(max_iter=1000, C=0.001)
model_right.fit(X_train_scaled_right, y_train)
train_score_right = model_right.score(X_train_scaled_right, y_train)
test_score_right = model_right.score(X_test_scaled_right, y_test)

Split data first: 1000 train, 300 test

Scaler fit on 1000 training samples only

Test set transformed using only training data statistics

Model performance (RIGHT way):

  • Training accuracy: 95.6%

  • Test accuracy: 49.0%

Why this is correct:

  • Scaler fit ONLY on training data

  • Test set treated as truly unseen data

  • Results are honest - reflects true performance on new data

  • Simulates real-world deployment scenario

Comparison: Wrong vs Right

  • Wrong way test accuracy: 54.7% (overly optimistic - used test data statistics)

  • Right way test accuracy: 49.0% (honest estimate - only used training statistics)

  • Difference of 5.7 percentage points represents the bias introduced by data leakage

Key insight: Even with features at different scales, leakage from improper preprocessing can significantly inflate your performance estimates!

The Golden Rule of Data Splitting#

Important

The Golden Rule:

SPLIT FIRST, THEN PREPROCESS

  1. Split your data into training, validation and testing sets.

  2. Fit preprocessing ONLY on training data.

  3. Transform all sets using training statistics.

  4. Never let information from val/test set influence preprocessing.