6.4.2. Cross-Validation#
You trained a model and tested it on a hold-out set. Accuracy: 92%.
But is that number trustworthy? What if you got lucky with that particular split?
Cross-validation gives you a more honest estimate of performance by evaluating the model on multiple different train/test splits and averaging the results.
The core idea: Instead of betting everything on one split, use several — and report the average.
6.4.2.1. The Problem with a Single Train-Test Split#
When you split data once, the result depends heavily on which examples ended up in the test set. A single split can be misleading.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from myst_nb import glue
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score
np.random.seed(42)
# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# Try five different random splits
rows = []
for seed in range(5):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=seed
)
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
rows.append({'Split': f'Split {seed+1}', 'Accuracy': round(acc, 3)})
scores = [r['Accuracy'] for r in rows]
glue('split-range', round((max(scores) - min(scores)) * 100, 1), display=False)
display(pd.DataFrame(rows))
| Split | Accuracy | |
|---|---|---|
| 0 | Split 1 | 0.912 |
| 1 | Split 2 | 0.956 |
| 2 | Split 3 | 0.912 |
| 3 | Split 4 | 0.886 |
| 4 | Split 5 | 0.912 |
The same model, on the same dataset, with five different splits gives noticeably different numbers. This is the problem cross-validation solves.
6.4.2.2. What is K-Fold Cross-Validation?#
K-Fold CV splits the data into K equal parts (folds). It then trains K separate models, each time holding out one fold as the test set and training on the remaining K-1 folds.
Data: [Fold 1] [Fold 2] [Fold 3] [Fold 4] [Fold 5]
Run 1: TEST TRAIN TRAIN TRAIN TRAIN → Score 1
Run 2: TRAIN TEST TRAIN TRAIN TRAIN → Score 2
Run 3: TRAIN TRAIN TEST TRAIN TRAIN → Score 3
Run 4: TRAIN TRAIN TRAIN TEST TRAIN → Score 4
Run 5: TRAIN TRAIN TRAIN TRAIN TEST → Score 5
Final score = mean(Score 1 ... Score 5)
Every example ends up in the test set exactly once. Nothing is wasted.
model = DecisionTreeClassifier(random_state=42)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
glue('cv-mean', round(float(cv_scores.mean()), 3), display=False)
glue('cv-std', round(float(cv_scores.std()), 3), display=False)
fold_df = pd.DataFrame({
'Fold': [f'Fold {i+1}' for i in range(len(cv_scores))],
'Accuracy': cv_scores.round(3),
})
fold_df.loc[len(fold_df)] = ['Mean ± Std', f"{cv_scores.mean():.3f} ± {cv_scores.std():.3f}"]
display(fold_df)
| Fold | Accuracy | |
|---|---|---|
| 0 | Fold 1 | 0.912 |
| 1 | Fold 2 | 0.904 |
| 2 | Fold 3 | 0.93 |
| 3 | Fold 4 | 0.956 |
| 4 | Fold 5 | 0.885 |
| 5 | Mean ± Std | 0.917 ± 0.024 |
The mean gives the expected performance. The standard deviation tells you how stable that estimate is — a high std means performance varies a lot across data subsets (a warning sign).
6.4.2.3. Under the Hood: How the Splits are Made#
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold_rows = []
for fold_num, (train_idx, test_idx) in enumerate(kf.split(X), start=1):
fold_rows.append({'Fold': fold_num, 'Train samples': len(train_idx), 'Test samples': len(test_idx)})
display(pd.DataFrame(fold_rows))
| Fold | Train samples | Test samples | |
|---|---|---|---|
| 0 | 1 | 455 | 114 |
| 1 | 2 | 455 | 114 |
| 2 | 3 | 455 | 114 |
| 3 | 4 | 455 | 114 |
| 4 | 5 | 456 | 113 |
6.4.2.4. Stratified K-Fold: Preserving Class Balance#
Standard K-Fold splits data randomly. On imbalanced datasets — where one class is rare — some folds might end up with very few (or zero) examples of the minority class.
Stratified K-Fold ensures each fold has roughly the same proportion of each class as the whole dataset. This is almost always what you want for classification.
# Artificially create an imbalanced version
np.random.seed(42)
minority_mask = (y == 0)
majority_mask = (y == 1)
keep_minority = np.random.choice(np.where(minority_mask)[0], size=40, replace=False)
keep_majority = np.where(majority_mask)[0]
keep_idx = np.sort(np.concatenate([keep_minority, keep_majority]))
X_imb = X[keep_idx]
y_imb = y[keep_idx]
# Standard K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
std_rows = []
for fold, (tr, te) in enumerate(kf.split(X_imb, y_imb), 1):
prop = sum(y_imb[te] == 0) / len(te)
std_rows.append({'Fold': fold, 'Method': 'Standard KFold',
'Class-0 %': f'{prop:.1%}', 'Class-0 count': sum(y_imb[te]==0)})
# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
strat_rows = []
for fold, (tr, te) in enumerate(skf.split(X_imb, y_imb), 1):
prop = sum(y_imb[te] == 0) / len(te)
strat_rows.append({'Fold': fold, 'Method': 'StratifiedKFold',
'Class-0 %': f'{prop:.1%}', 'Class-0 count': sum(y_imb[te]==0)})
display(pd.DataFrame(std_rows + strat_rows))
| Fold | Method | Class-0 % | Class-0 count | |
|---|---|---|---|---|
| 0 | 1 | Standard KFold | 11.2% | 9 |
| 1 | 2 | Standard KFold | 12.5% | 10 |
| 2 | 3 | Standard KFold | 11.4% | 9 |
| 3 | 4 | Standard KFold | 6.3% | 5 |
| 4 | 5 | Standard KFold | 8.9% | 7 |
| 5 | 1 | StratifiedKFold | 10.0% | 8 |
| 6 | 2 | StratifiedKFold | 10.0% | 8 |
| 7 | 3 | StratifiedKFold | 10.1% | 8 |
| 8 | 4 | StratifiedKFold | 10.1% | 8 |
| 9 | 5 | StratifiedKFold | 10.1% | 8 |
Tip
Always use StratifiedKFold (or pass stratify=y to train_test_split) for classification problems. Standard KFold is sufficient for regression.
In cross_val_score, when y is categorical, scikit-learn automatically uses Stratified K-Fold behind the scenes.
6.4.2.5. Choosing K: How Many Folds?#
The most common choice is K=5 or K=10. Here is the tradeoff:
K |
Pros |
Cons |
|---|---|---|
Small (3-5) |
Faster to run |
Higher bias, noisier estimate |
Large (10) |
Lower bias |
Slower, each test set is small |
K = N (Leave-One-Out) |
Uses all data |
Very slow on large datasets |
model = DecisionTreeClassifier(random_state=42)
results = []
for k in [2, 3, 5, 10, 20]:
scores = cross_val_score(model, X, y, cv=k, scoring='accuracy')
results.append({
'K': k,
'Mean Accuracy': round(scores.mean(), 4),
'Std': round(scores.std(), 4),
})
display(pd.DataFrame(results))
| K | Mean Accuracy | Std | |
|---|---|---|---|
| 0 | 2 | 0.9279 | 0.0019 |
| 1 | 3 | 0.9121 | 0.0253 |
| 2 | 5 | 0.9173 | 0.0242 |
| 3 | 10 | 0.9280 | 0.0327 |
| 4 | 20 | 0.9244 | 0.0533 |
6.4.2.6. Cross-Validation is Not Just for Evaluation#
Cross-validation plays two distinct roles that are easy to confuse:
Role 1 — Model Selection: Comparing algorithms or preprocessing strategies. Which model family performs best? Use CV to compare options, then train the winner on all your data.
Role 2 — Hyperparameter Tuning: Finding the best settings for a model. GridSearchCV and RandomizedSearchCV run cross-validation automatically for each candidate. (Covered in the next section.)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
candidates = {
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Logistic Regression': Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(max_iter=1000, random_state=42))
]),
}
comp_rows = []
for name, mdl in candidates.items():
scores = cross_val_score(mdl, X, y, cv=5, scoring='accuracy')
comp_rows.append({'Model': name, 'CV Accuracy (mean ± std)': f"{scores.mean():.3f} ± {scores.std():.3f}"})
display(pd.DataFrame(comp_rows))
| Model | CV Accuracy (mean ± std) | |
|---|---|---|
| 0 | Decision Tree | 0.917 ± 0.024 |
| 1 | Random Forest | 0.956 ± 0.023 |
| 2 | Logistic Regression | 0.981 ± 0.007 |
6.4.2.7. The Golden Rule: Never Touch the Test Set#
Cross-validation is for development decisions (model choice, tuning). Your held-out test set should be used once, at the very end, to report final performance.
Full Dataset
│
├── Test Set (set aside immediately, do not touch)
│
└── Development Set
│
├── Fold 1 (train/val)
├── Fold 2 (train/val) ← Cross-validation happens here
├── Fold 3 (train/val)
└── ...
Warning
If you tune your model by repeatedly checking the test set, you are effectively training on it. The test score will be optimistic and will not reflect true generalization to new data. Use cross-validation on the development set instead.
6.4.2.8. Summary#
Technique |
When to Use |
|---|---|
|
Regression, or when classes are balanced |
|
Classification — preserves class proportions |
|
Quick evaluation of a model |
|
When you need multiple metrics at once |
Leave-One-Out |
Very small datasets (< 100 samples) |
Cross-validation is the foundation of reliable model development. Almost everything that follows — hyperparameter tuning, pipeline evaluation, model comparison — builds on top of it.