6.1.5. Underfitting and Overfitting#

Among all the ideas in machine learning, few are as fundamental as this one: a model must generalize.

Training data is only a sample of the world. The real test of a model is not how well it performs on what it has already seen, but how well it performs on what it has never seen before.

A model that only memorizes is useless in practice. A model that captures structure is valuable.

You can think of it in human terms:

  • A student who memorizes practice problems often fails the exam.

  • A student who understands principles can solve new problems.

Machine learning faces the same tension.


6.1.5.1. The Central Problem#

Every supervised learning task involves two goals that pull in opposite directions:

  1. Fit the training data well.

  2. Perform well on new data.

If we push too hard on the first goal, we risk memorization. If we do not push hard enough, we risk missing the signal entirely.

To make this concrete, we generate data from a curved function and split it into training and test sets.

Hide code cell source

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from myst_nb import glue

np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 50)).reshape(-1, 1)
true_function = lambda x: 10 + 2*x - 0.3*x**2
y = true_function(X.ravel()) + np.random.normal(0, 3, 50)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

This dataset contains a real quadratic pattern plus noise. The question is: how complex should our model be?


6.1.5.2. Underfitting#

Underfitting occurs when a model is too simple to capture the underlying structure of the data.

In this case, we fit a linear model to data that is clearly curved.

model_underfit = make_pipeline(PolynomialFeatures(1), LinearRegression())
model_underfit.fit(X_train, y_train)

train_score = model_underfit.score(X_train, y_train)
test_score = model_underfit.score(X_test, y_test)
train_mse = mean_squared_error(y_train, model_underfit.predict(X_train))
test_mse = mean_squared_error(y_test, model_underfit.predict(X_test))

X_plot = np.linspace(0, 10, 200).reshape(-1, 1)
y_pred_plot = model_underfit.predict(X_plot)
y_true_plot = true_function(X_plot.ravel())

Training R²: 0.308 Test R²: 0.334

Hide code cell source

plt.figure(figsize=(12, 5))
plt.scatter(X_train, y_train, s=60, alpha=0.6, label='Training data')
plt.scatter(X_test, y_test, s=100, marker='s', alpha=0.7, label='Test data')
plt.plot(X_plot, y_true_plot, 'k--', linewidth=2, alpha=0.5, label='True function')
plt.plot(X_plot, y_pred_plot, linewidth=3, label='Model (too simple)')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Underfitting: Model Too Simple')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
../../../_images/b4f7f3a2914e9fb61a26df638140de6b0b95df2a3571711b38906517f6618eb5.png

The straight line cannot represent curvature. Both training and test performance are poor.

Hallmarks of Underfitting#

  • High training error

  • High test error

  • Small gap between training and test

  • Model systematically misses the pattern

Underfitting is a capacity problem. The model simply lacks the expressive power to represent the true relationship.

Common causes include:

  • Insufficient model complexity

  • Missing or weak features

  • Excessive regularization

  • Inadequate training


6.1.5.3. Overfitting#

Overfitting occurs when a model is too complex and begins to model noise rather than structure.

Now we fit a high degree polynomial.

model_overfit = make_pipeline(PolynomialFeatures(15), LinearRegression())
model_overfit.fit(X_train, y_train)

train_score_over = model_overfit.score(X_train, y_train)
test_score_over = model_overfit.score(X_test, y_test)

y_pred_plot_over = model_overfit.predict(X_plot)

Training R²: 0.614 Test R²: 0.704

Hide code cell source

plt.figure(figsize=(12, 5))
plt.scatter(X_train, y_train, s=60, alpha=0.6, label='Training data')
plt.scatter(X_test, y_test, s=100, marker='s', alpha=0.7, label='Test data')
plt.plot(X_plot, y_true_plot, 'k--', linewidth=2, alpha=0.5, label='True function')
plt.plot(X_plot, y_pred_plot_over, linewidth=3, label='Model (too complex)')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Overfitting: Model Too Complex')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
../../../_images/6a26b3b44ed53fd90613e3b160f5b44d22ab9710dc27808eabbc829b51a0c693.png

The model fits the training data extremely well. But between points, it oscillates wildly. The test score drops.

Hallmarks of Overfitting#

  • Very low training error

  • High test error

  • Large gap between training and test

  • Unstable or erratic predictions

Overfitting is a variance problem. The model adapts too closely to random fluctuations in the training data.

Common causes include:

  • Excessive model complexity

  • Too little training data

  • Lack of regularization

  • Training for too long


6.1.5.4. The Sweet Spot#

The goal is neither simplicity nor complexity for its own sake. The goal is alignment between model capacity and data structure.

Here we fit a quadratic model, matching the true function.

model_justright = make_pipeline(PolynomialFeatures(2), LinearRegression())
model_justright.fit(X_train, y_train)

train_score_jr = model_justright.score(X_train, y_train)
test_score_jr = model_justright.score(X_test, y_test)

Training R²: 0.544 Test R²: 0.693 Gap: 0.150

Both scores are strong. The gap is small. The model captures structure without memorizing noise.

This is the balance we seek.


6.1.5.5. The Diagnostic Curve#

If we increase model complexity gradually and track performance, a clear pattern emerges.

Hide code cell source

degrees = range(1, 16)
train_scores = []
test_scores = []

for degree in degrees:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X_train, y_train)

    train_scores.append(model.score(X_train, y_train))
    test_scores.append(model.score(X_test, y_test))

plt.figure(figsize=(12, 6))
plt.plot(degrees, train_scores, 'o-', label='Training Score')
plt.plot(degrees, test_scores, 's-', label='Test Score')

plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('R² Score')
plt.title('The Underfitting–Overfitting Trade-off')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
../../../_images/02f178ac720dce97627eb8bf130c9c1a4a203f9ac78f36086432c3aead001d7c.png

Two universal patterns appear:

  1. Training performance improves monotonically with complexity.

  2. Test performance improves, peaks, and then declines.

The peak of the test curve marks the optimal level of complexity.

6.1.5.6. Learning Curves: Another Diagnostic Tool#

So far, we have studied performance as a function of model complexity. There is another equally powerful perspective: performance as a function of data size.

A learning curve shows how training and validation performance change as we provide the model with more data. Instead of asking, “Is my model too complex?”, we ask, “Do I have enough data?”

Hide code cell source

from sklearn.model_selection import learning_curve, ShuffleSplit

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

cv = ShuffleSplit(n_splits=20, test_size=0.3, random_state=42)

max_train_size = int((1 - 0.3) * len(X))  # 35

train_size_range = np.linspace(10, max_train_size, 8, dtype=int)

models_lc = [
    (make_pipeline(PolynomialFeatures(1), LinearRegression()), 'Underfitting (degree=1)'),
    (make_pipeline(PolynomialFeatures(2), LinearRegression()), 'Good Fit (degree=2)'),
    (make_pipeline(PolynomialFeatures(4), LinearRegression()), 'Overfitting (degree=4)')
]

for ax, (model, title) in zip(axes, models_lc):

    train_sizes, train_scores_lc, val_scores_lc = learning_curve(
        model,
        X,
        y,
        cv=cv,
        train_sizes=train_size_range,
        scoring='r2',
        n_jobs=-1
    )

    train_mean = train_scores_lc.mean(axis=1)
    train_std = train_scores_lc.std(axis=1)
    val_mean = val_scores_lc.mean(axis=1)
    val_std = val_scores_lc.std(axis=1)

    ax.plot(train_sizes, train_mean, 'o-', linewidth=2, label='Training')
    ax.fill_between(train_sizes,
                    train_mean - train_std,
                    train_mean + train_std,
                    alpha=0.2)

    ax.plot(train_sizes, val_mean, 's-', linewidth=2, label='Validation')
    ax.fill_between(train_sizes,
                    val_mean - val_std,
                    val_mean + val_std,
                    alpha=0.2)

    ax.set_xlabel('Training Set Size')
    ax.set_ylabel('R² Score')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_2407_5cc6cbb43b3441bb90d57947832c1a67_877931c8976a4af0806e620e8a5a1f81 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-2407-67ol3ikw for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-2407-ib94pmzd for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-2407-enx10lr1 for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-2407-j4ts2awb for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-2407-ihnyoshw for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-2407-a2lp9qwi for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_2407_5cc6cbb43b3441bb90d57947832c1a67_8e7f755a9bc34f99ae306b64dd6976b2 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-2407-rpny50qd for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-2407-3u0xjnf1 for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_2407_5cc6cbb43b3441bb90d57947832c1a67_8e7f755a9bc34f99ae306b64dd6976b2 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_2407_a2782c456d3b45099984219d60049f8d_552e5d82c9334092afe8c17f0a00f764 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_2407_5cc6cbb43b3441bb90d57947832c1a67_4a78dfd224ad4bf091896b3af7dd3f58 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_2407_5cc6cbb43b3441bb90d57947832c1a67_4a78dfd224ad4bf091896b3af7dd3f58 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_2407_ee204a4a89ff4fc795b7cadd4c380006_197d9afdbadb4a3c8de107408e2ce2ff for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_2407_5cc6cbb43b3441bb90d57947832c1a67_f868c708f68140b9817788203cc993ba for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_2407_5cc6cbb43b3441bb90d57947832c1a67_f868c708f68140b9817788203cc993ba for automatic cleanup: unknown resource type folder
../../../_images/8290fc3d56db0eba6c3b87de7f131bcaa93edb9ca101f1993560642b8a8858c7.png

A learning curve reveals not only performance, but also why performance behaves the way it does.

Interpreting Learning Curves#

Underfitting

  • Training and validation scores are both low

  • The curves converge quickly

  • Adding more data does not significantly improve performance

The model lacks sufficient capacity. More data cannot fix a model that is fundamentally too simple.


Good Fit

  • Both curves converge to a high score

  • The gap between them is small

  • Performance improves gradually with more data

The model is well matched to the problem. Additional data may still yield incremental gains.


Overfitting

  • Training score is high

  • Validation score is substantially lower

  • The gap narrows as more data is added

This pattern signals high variance. The model is too sensitive to individual training examples. In this case, more data can genuinely help.

Learning curves therefore answer an important practical question: Is my problem due to model complexity, or data quantity?


6.1.5.7. Practical Example: Real Dataset#

To see that these principles extend beyond synthetic examples, consider a real regression problem using the California Housing dataset.

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
import pandas as pd

housing = fetch_california_housing()
X_house = housing.data[:400]
y_house = housing.target[:400]

X_h_train, X_h_test, y_h_train, y_h_test = train_test_split(
    X_house, y_house, test_size=0.3, random_state=42
)

We train decision trees with increasing depth. Tree depth directly controls complexity: shallow trees are simple, deep trees are expressive.

depths = [1, 3, 5, 10, 20, None]
results = []

for depth in depths:
    model = DecisionTreeRegressor(max_depth=depth, random_state=42)
    model.fit(X_h_train, y_h_train)

    train_score = model.score(X_h_train, y_h_train)
    test_score = model.score(X_h_test, y_h_test)
    gap = abs(train_score - test_score)

    depth_str = str(depth) if depth else "∞"
    status = "Underfit" if train_score < 0.7 and test_score < 0.7 else \
             "Overfit" if gap > 0.15 else "Good"

    results.append({
        'depth': depth_str,
        'train': train_score,
        'test': test_score,
        'gap': gap,
        'status': status
    })

df_results = pd.DataFrame(results)
df_results
depth train test gap status
0 1 0.530951 0.446194 0.084757 Underfit
1 3 0.786806 0.654873 0.131933 Good
2 5 0.930228 0.667421 0.262807 Overfit
3 10 0.993562 0.684156 0.309406 Overfit
4 20 1.000000 0.655484 0.344516 Overfit
5 1.000000 0.655484 0.344516 Overfit

As depth increases, the familiar pattern reappears:

  • Very shallow trees underfit

  • Extremely deep trees overfit

  • Intermediate depths generalize best

The phenomenon is not tied to polynomial regression. It is structural to supervised learning itself.


6.1.5.8. How to Fix Underfitting and Overfitting#

When you diagnose underfitting, consider:

  • Increasing model complexity

  • Adding more informative features

  • Reducing regularization

  • Training longer if optimization is incomplete

When you diagnose overfitting, consider:

  • Reducing model complexity

  • Adding regularization

  • Gathering more data

  • Using early stopping

  • Performing feature selection

Each intervention adjusts either model capacity or effective variance.