6.4.1. Baseline Models#

Before you train a sophisticated model, you need something to beat.

A baseline model makes predictions using the simplest possible rule — no learning, no patterns, no features. Its only job is to answer the question: what can you achieve by being completely naive?

If your trained model cannot outperform a baseline, something is wrong: the features carry no signal, the labels are corrupted, or there is a bug in your pipeline. The baseline is not a competitor — it is a sanity check.

6.4.1.1. Why You Need a Baseline#

It is easy to see 85% accuracy and think the model is working well. But if always predicting the majority class also gives 85%, the model has learned nothing at all.

On the 900 / 100 split below, a rule that always predicts class 0 achieves 90% accuracy. Any model that scores below that number has literally learned nothing.

import numpy as np
import pandas as pd
from myst_nb import glue
from sklearn.datasets import load_breast_cancer, load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

# Imbalanced dataset: 90% class 0, 10% class 1
y_imbalanced = np.array([0] * 900 + [1] * 100)
X_imbalanced = np.random.randn(1000, 10)

always_zero_acc = (y_imbalanced == 0).mean()
glue('always-zero-acc', f'{always_zero_acc:.0%}', display=False)

6.4.1.2. Baseline Options for Classification#

Scikit-learn’s DummyClassifier implements all common baselines. The strategy parameter controls the rule:

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

strategies = {
    'most_frequent': 'Always predict the most common class',
    'stratified':    'Sample randomly, respecting class proportions',
    'uniform':       'Sample uniformly at random from all classes',
    'prior':         'Always predict the class with highest prior',
}

rows = []
for strategy, description in strategies.items():
    dummy = DummyClassifier(strategy=strategy, random_state=42)
    dummy.fit(X_train, y_train)
    acc = dummy.score(X_test, y_test)
    rows.append({'Strategy': strategy, 'Accuracy': round(acc, 3), 'Rule': description})

# Compare to a real model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rows.append({'Strategy': 'RandomForest', 'Accuracy': round(rf.score(X_test, y_test), 3), 'Rule': 'Trained model'})

display(pd.DataFrame(rows))
Strategy Accuracy Rule
0 most_frequent 0.632 Always predict the most common class
1 stratified 0.596 Sample randomly, respecting class proportions
2 uniform 0.553 Sample uniformly at random from all classes
3 prior 0.632 Always predict the class with highest prior
4 RandomForest 0.956 Trained model

Which strategy to use as your reference?

  • most_frequent — the most common baseline for classification. Directly answers: “what if we always predict the majority?” Essential on imbalanced datasets.

  • stratified — useful when you care about performance across classes, not just overall accuracy.

  • prior — equivalent to most_frequent for binary problems; differs on multi-class.

Tip

For imbalanced classification, also check baseline F1 score and AUC, not just accuracy. A model that always predicts the majority class scores 0.0 F1 on the minority — which often reflects the actual problem better than accuracy does.

from sklearn.metrics import f1_score, roc_auc_score

# Simulate imbalanced data
np.random.seed(42)
y_imb = np.array([0] * 450 + [1] * 50)
X_imb = np.random.randn(500, 10)
X_tr, X_te, y_tr, y_te = train_test_split(X_imb, y_imb, test_size=0.2, stratify=y_imb, random_state=42)

dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_tr, y_tr)
y_pred_dummy = dummy.predict(X_te)

rf2 = RandomForestClassifier(n_estimators=100, random_state=42)
rf2.fit(X_tr, y_tr)
y_pred_rf = rf2.predict(X_te)

metrics_df = pd.DataFrame({
    'Metric':       ['Accuracy', 'F1 (minority class)'],
    'Baseline':     [round((y_pred_dummy == y_te).mean(), 3),
                     round(f1_score(y_te, y_pred_dummy, pos_label=1), 3)],
    'RandomForest': [round((y_pred_rf == y_te).mean(), 3),
                     round(f1_score(y_te, y_pred_rf, pos_label=1), 3)],
})
display(metrics_df)
Metric Baseline RandomForest
0 Accuracy 0.9 0.9
1 F1 (minority class) 0.0 0.0

6.4.1.3. Baseline Options for Regression#

For regression, the baseline predicts a fixed constant for every input.

diabetes = load_diabetes()
X_reg, y_reg = diabetes.data, diabetes.target

X_tr_r, X_te_r, y_tr_r, y_te_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

strategies_reg = {
    'mean':   'Predict the training mean for every sample',
    'median': 'Predict the training median for every sample',
}

reg_rows = []
for strategy, description in strategies_reg.items():
    dummy = DummyRegressor(strategy=strategy)
    dummy.fit(X_tr_r, y_tr_r)
    y_pred = dummy.predict(X_te_r)
    reg_rows.append({
        'Model':       strategy,
        'RMSE':        round(np.sqrt(mean_squared_error(y_te_r, y_pred)), 1),
        'MAE':         round(mean_absolute_error(y_te_r, y_pred), 1),
        'R²':          round(r2_score(y_te_r, y_pred), 3),
        'Rule':        description,
    })

lr = LinearRegression()
lr.fit(X_tr_r, y_tr_r)
y_pred_lr = lr.predict(X_te_r)
reg_rows.append({
    'Model':  'LinearRegression',
    'RMSE':   round(np.sqrt(mean_squared_error(y_te_r, y_pred_lr)), 1),
    'MAE':    round(mean_absolute_error(y_te_r, y_pred_lr), 1),
    'R²':     round(r2_score(y_te_r, y_pred_lr), 3),
    'Rule':   'Trained model',
})

display(pd.DataFrame(reg_rows))
Model RMSE MAE Rule
0 mean 73.2 64.0 -0.012 Predict the training mean for every sample
1 median 72.9 62.7 -0.003 Predict the training median for every sample
2 LinearRegression 53.9 42.8 0.453 Trained model

Which strategy for regression?

  • mean — the standard baseline. Note that \(R^2 = 0\) when predicting the mean by definition — any positive \(R^2\) means your model explains variance beyond the mean.

  • median — preferred when outliers are present; more robust than the mean.

Note

\(R^2 = 0\) for the mean-predicting baseline is not a coincidence. \(R^2\) is defined as \(1 - \frac{SS_{res}}{SS_{tot}}\), where \(SS_{tot}\) is the variance around the mean. Predicting the mean makes \(SS_{res} = SS_{tot}\), so \(R^2 = 0\). A negative \(R^2\) means your model is worse than always guessing the mean — a serious warning sign.

6.4.1.4. Using Baselines in Practice#

Establishing a baseline is the first step in any modelling workflow. Run it once, record the number, and treat it as the minimum bar your model must clear.

# Step 1: establish baseline
baseline = DummyClassifier(strategy='most_frequent', random_state=42)
baseline_scores = cross_val_score(baseline, X, y, cv=5, scoring='accuracy')

# Step 2: first real model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')

lift = rf_scores.mean() - baseline_scores.mean()

glue('baseline-acc',  f"{baseline_scores.mean():.3f} ± {baseline_scores.std():.3f}", display=False)
glue('rf-acc',        f"{rf_scores.mean():.3f} ± {rf_scores.std():.3f}",        display=False)
glue('lift-pp',       f"{lift*100:.1f}",                                          display=False)

workflow_df = pd.DataFrame({
    'Step':    ['Baseline (most_frequent)', 'RandomForest'],
    'CV Accuracy (mean ± std)': [
        f"{baseline_scores.mean():.3f} ± {baseline_scores.std():.3f}",
        f"{rf_scores.mean():.3f} ± {rf_scores.std():.3f}",
    ],
})
display(workflow_df)
Step CV Accuracy (mean ± std)
0 Baseline (most_frequent) 0.627 ± 0.004
1 RandomForest 0.956 ± 0.023

RandomForest lifts cross-validated accuracy by 32.9 percentage points over the majority-class baseline. The lift, not the raw accuracy, is the real measure of progress.

6.4.1.5. Summary#

Problem type

Recommended baseline

Key metric to compare

Binary classification

most_frequent

F1 on minority class, AUC

Multi-class classification

most_frequent or stratified

Macro F1

Regression

mean

RMSE, R²

Regression with outliers

median

MAE

Every trained model you build in this chapter will be measured against a baseline first. If it cannot beat a dummy that predicts the mean or the majority class, there is no point tuning it.