Cross-Validation

6.4.2. Cross-Validation#

You trained a model and tested it on a hold-out set. Accuracy: 92%.

But is that number trustworthy? What if you got lucky with that particular split?

Cross-validation gives you a more honest estimate of performance by evaluating the model on multiple different train/test splits and averaging the results.

The core idea: Instead of betting everything on one split, use several — and report the average.

6.4.2.1. The Problem with a Single Train-Test Split#

When you split data once, the result depends heavily on which examples ended up in the test set. A single split can be misleading.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from myst_nb import glue
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score

np.random.seed(42)

# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Try five different random splits
rows = []
for seed in range(5):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed
    )
    model = DecisionTreeClassifier(random_state=42)
    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    rows.append({'Split': f'Split {seed+1}', 'Accuracy': round(acc, 3)})

scores = [r['Accuracy'] for r in rows]
glue('split-range', round((max(scores) - min(scores)) * 100, 1), display=False)
display(pd.DataFrame(rows))

	Split	Accuracy
0	Split 1	0.912
1	Split 2	0.956
2	Split 3	0.912
3	Split 4	0.886
4	Split 5	0.912

The same model, on the same dataset, with five different splits gives noticeably different numbers. This is the problem cross-validation solves.

6.4.2.2. What is K-Fold Cross-Validation?#

K-Fold CV splits the data into K equal parts (folds). It then trains K separate models, each time holding out one fold as the test set and training on the remaining K-1 folds.

Data:  [Fold 1] [Fold 2] [Fold 3] [Fold 4] [Fold 5]

Run 1:  TEST    TRAIN    TRAIN    TRAIN    TRAIN   → Score 1
Run 2:  TRAIN   TEST     TRAIN    TRAIN    TRAIN   → Score 2
Run 3:  TRAIN   TRAIN    TEST     TRAIN    TRAIN   → Score 3
Run 4:  TRAIN   TRAIN    TRAIN    TEST     TRAIN   → Score 4
Run 5:  TRAIN   TRAIN    TRAIN    TRAIN    TEST    → Score 5

Final score = mean(Score 1 ... Score 5)

Every example ends up in the test set exactly once. Nothing is wasted.

model = DecisionTreeClassifier(random_state=42)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

glue('cv-mean', round(float(cv_scores.mean()), 3), display=False)
glue('cv-std',  round(float(cv_scores.std()),  3), display=False)

fold_df = pd.DataFrame({
    'Fold':     [f'Fold {i+1}' for i in range(len(cv_scores))],
    'Accuracy': cv_scores.round(3),
})
fold_df.loc[len(fold_df)] = ['Mean ± Std', f"{cv_scores.mean():.3f} ± {cv_scores.std():.3f}"]
display(fold_df)

	Fold	Accuracy
0	Fold 1	0.912
1	Fold 2	0.904
2	Fold 3	0.93
3	Fold 4	0.956
4	Fold 5	0.885
5	Mean ± Std	0.917 ± 0.024

The mean gives the expected performance. The standard deviation tells you how stable that estimate is — a high std means performance varies a lot across data subsets (a warning sign).

6.4.2.3. Under the Hood: How the Splits are Made#

kf = KFold(n_splits=5, shuffle=True, random_state=42)

fold_rows = []
for fold_num, (train_idx, test_idx) in enumerate(kf.split(X), start=1):
    fold_rows.append({'Fold': fold_num, 'Train samples': len(train_idx), 'Test samples': len(test_idx)})

display(pd.DataFrame(fold_rows))

	Fold	Train samples	Test samples
0	1	455	114
1	2	455	114
2	3	455	114
3	4	455	114
4	5	456	113

6.4.2.4. Stratified K-Fold: Preserving Class Balance#

Standard K-Fold splits data randomly. On imbalanced datasets — where one class is rare — some folds might end up with very few (or zero) examples of the minority class.

Stratified K-Fold ensures each fold has roughly the same proportion of each class as the whole dataset. This is almost always what you want for classification.

# Artificially create an imbalanced version
np.random.seed(42)
minority_mask = (y == 0)
majority_mask = (y == 1)

keep_minority = np.random.choice(np.where(minority_mask)[0], size=40, replace=False)
keep_majority = np.where(majority_mask)[0]
keep_idx = np.sort(np.concatenate([keep_minority, keep_majority]))

X_imb = X[keep_idx]
y_imb = y[keep_idx]

# Standard K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
std_rows = []
for fold, (tr, te) in enumerate(kf.split(X_imb, y_imb), 1):
    prop = sum(y_imb[te] == 0) / len(te)
    std_rows.append({'Fold': fold, 'Method': 'Standard KFold',
                     'Class-0 %': f'{prop:.1%}', 'Class-0 count': sum(y_imb[te]==0)})

# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
strat_rows = []
for fold, (tr, te) in enumerate(skf.split(X_imb, y_imb), 1):
    prop = sum(y_imb[te] == 0) / len(te)
    strat_rows.append({'Fold': fold, 'Method': 'StratifiedKFold',
                       'Class-0 %': f'{prop:.1%}', 'Class-0 count': sum(y_imb[te]==0)})

display(pd.DataFrame(std_rows + strat_rows))

	Fold	Method	Class-0 %	Class-0 count
0	1	Standard KFold	11.2%	9
1	2	Standard KFold	12.5%	10
2	3	Standard KFold	11.4%	9
3	4	Standard KFold	6.3%	5
4	5	Standard KFold	8.9%	7
5	1	StratifiedKFold	10.0%	8
6	2	StratifiedKFold	10.0%	8
7	3	StratifiedKFold	10.1%	8
8	4	StratifiedKFold	10.1%	8
9	5	StratifiedKFold	10.1%	8

Tip

Always use StratifiedKFold (or pass stratify=y to train_test_split) for classification problems. Standard KFold is sufficient for regression.

In cross_val_score, when y is categorical, scikit-learn automatically uses Stratified K-Fold behind the scenes.

6.4.2.5. Choosing K: How Many Folds?#

The most common choice is K=5 or K=10. Here is the tradeoff:

K	Pros	Cons
Small (3-5)	Faster to run	Higher bias, noisier estimate
Large (10)	Lower bias	Slower, each test set is small
K = N (Leave-One-Out)	Uses all data	Very slow on large datasets

model = DecisionTreeClassifier(random_state=42)
results = []

for k in [2, 3, 5, 10, 20]:
    scores = cross_val_score(model, X, y, cv=k, scoring='accuracy')
    results.append({
        'K':              k,
        'Mean Accuracy':  round(scores.mean(), 4),
        'Std':            round(scores.std(),  4),
    })

display(pd.DataFrame(results))

	K	Mean Accuracy	Std
0	2	0.9279	0.0019
1	3	0.9121	0.0253
2	5	0.9173	0.0242
3	10	0.9280	0.0327
4	20	0.9244	0.0533

6.4.2.6. Cross-Validation is Not Just for Evaluation#

Cross-validation plays two distinct roles that are easy to confuse:

Role 1 — Model Selection: Comparing algorithms or preprocessing strategies. Which model family performs best? Use CV to compare options, then train the winner on all your data.

Role 2 — Hyperparameter Tuning: Finding the best settings for a model. GridSearchCV and RandomizedSearchCV run cross-validation automatically for each candidate. (Covered in the next section.)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

candidates = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression(max_iter=1000, random_state=42))
    ]),
}

comp_rows = []
for name, mdl in candidates.items():
    scores = cross_val_score(mdl, X, y, cv=5, scoring='accuracy')
    comp_rows.append({'Model': name, 'CV Accuracy (mean ± std)': f"{scores.mean():.3f} ± {scores.std():.3f}"})

display(pd.DataFrame(comp_rows))

	Model	CV Accuracy (mean ± std)
0	Decision Tree	0.917 ± 0.024
1	Random Forest	0.956 ± 0.023
2	Logistic Regression	0.981 ± 0.007

6.4.2.7. The Golden Rule: Never Touch the Test Set#

Cross-validation is for development decisions (model choice, tuning). Your held-out test set should be used once, at the very end, to report final performance.

Full Dataset
    │
    ├── Test Set (set aside immediately, do not touch)
    │
    └── Development Set
            │
            ├── Fold 1 (train/val)
            ├── Fold 2 (train/val)  ← Cross-validation happens here
            ├── Fold 3 (train/val)
            └── ...

Warning

If you tune your model by repeatedly checking the test set, you are effectively training on it. The test score will be optimistic and will not reflect true generalization to new data. Use cross-validation on the development set instead.

6.4.2.8. Summary#

Technique	When to Use
`KFold`	Regression, or when classes are balanced
`StratifiedKFold`	Classification — preserves class proportions
`cross_val_score`	Quick evaluation of a model
`cross_validate`	When you need multiple metrics at once
Leave-One-Out	Very small datasets (< 100 samples)

Cross-validation is the foundation of reliable model development. Almost everything that follows — hyperparameter tuning, pipeline evaluation, model comparison — builds on top of it.