The Bias-Variance Tradeoff

6.1.6. The Bias-Variance Tradeoff#

In the previous section, we explored underfitting and overfitting at a practical level. Now we’ll dig deeper into why this tradeoff exists. The answer lies in one of the most elegant concepts in machine learning: the bias-variance tradeoff.

Understanding this concept will transform you from someone who can train models to someone who truly understands what’s happening under the hood. This is where intuition meets theory!

6.1.6.1. The Fundamental Question#

When your model makes a prediction error, where does that error come from? Is it because:

Your model is too simple? (bias)
Your model is too sensitive to training data? (variance)
The data itself is noisy? (irreducible error)

It turns out all three contribute to your total error, and there’s a beautiful mathematical relationship between them.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from myst_nb import glue

# Set random seed for reproducibility
np.random.seed(42)

Total Error = Bias² + Variance + Irreducible Error

Think of it as:

Bias: How far off is your model on average?
Variance: How much does your model jump around?
Irreducible: Noise we can’t eliminate

6.1.6.2. Bias: Error from Wrong Assumptions#

Bias is the error introduced by approximating a complex real-world problem with a simplified model.

Think of it this way:

You’re trying to catch a baseball
But you assume it travels in a straight line (ignoring gravity)
Bias = How far off you are on average due to this wrong assumption

In machine learning:

High bias = model is too simple = underfitting
The model makes systematic errors because it can’t capture the true pattern

def true_function(x):
    """The true underlying pattern (unknown in real problems!)"""
    return 5 + 2*x - 0.3*x**2

# Generate data
X = np.linspace(0, 10, 100).reshape(-1, 1)
y_true = true_function(X.ravel())
y_noisy = y_true + np.random.normal(0, 2, 100)

# Train high-bias model
model_high_bias = LinearRegression()
model_high_bias.fit(X, y_noisy)
y_pred_bias = model_high_bias.predict(X)

../../../_images/2bffb56feead6071492ad7d1a2d1c3edc295c3b892b31406366624764ec20e1a.png

Bias² (squared average error): 5.25

Notice: Linear model ALWAYS underestimates at the edges. This systematic error is BIAS.

Characteristics of High Bias#

High Bias Symptoms:

Model too simple for the problem
Underfits both training and test data
High training error
High test error
Model makes systematic, predictable errors
Adding more data doesn’t help much

Examples:

Linear regression on non-linear data
Shallow decision tree on complex data
Simple rule-based system for complex task

6.1.6.3. Variance: Error from Sensitivity to Training Data#

Variance is the error from the model being too sensitive to small fluctuations in the training data.

Think of it this way:

You’re trying to draw the “average” face
But you only saw 3 people, and one had a huge nose
Now your “average” face has a huge nose!
Variance = How much your answer changes based on which examples you saw

In machine learning:

High variance = model is too flexible = overfitting
The model changes dramatically with different training sets

../../../_images/5e4071be87aede647463dc78637610c4d4bcc480d6226fa5c69f0b1aa6551a1c.png

Average variance across predictions: 237.15

Notice: Different training sets → wildly different predictions. This sensitivity is VARIANCE.

Characteristics of High Variance#

High Variance Symptoms:

Model too complex/flexible for the data
Overfits training data
Low training error (fits training data perfectly)
High test error (fails on new data)
Predictions change drastically with different training sets
Large gap between train and test performance

Examples:

High-degree polynomial on small dataset
Deep decision tree without pruning
Neural network with too many parameters

6.1.6.4. The Tradeoff: You Can’t Have Both Low#

Here’s the fundamental insight: You cannot simultaneously minimize both bias and variance.

Why? Because they pull in opposite directions:

Reducing bias → Add complexity → Increases variance
Reducing variance → Simplify model → Increases bias

Show code cell source

Hide code cell source

degrees = range(1, 16)
n_trials = 20

bias_squared_values = []
variance_values = []
total_error_values = []

# Systematically evaluate bias and variance across complexity levels
for degree in degrees:
    predictions_per_degree = []

    for trial in range(n_trials):
        X_train = np.sort(np.random.uniform(0, 10, 30)).reshape(-1, 1)
        y_train = true_function(X_train.ravel()) + np.random.normal(0, 2, 30)

        model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
        model.fit(X_train, y_train)

        X_test = np.linspace(0, 10, 50).reshape(-1, 1)
        y_pred = model.predict(X_test)
        predictions_per_degree.append(y_pred)

    predictions_per_degree = np.array(predictions_per_degree)

    y_true_test = true_function(X_test.ravel())

    # Calculate bias and variance
    mean_prediction = predictions_per_degree.mean(axis=0)
    bias_squared = np.mean((mean_prediction - y_true_test)**2)
    variance = np.mean(predictions_per_degree.var(axis=0))

    irreducible_error = 4.0

    total_error = bias_squared + variance + irreducible_error

    bias_squared_values.append(bias_squared)
    variance_values.append(variance)
    total_error_values.append(total_error)

plt.figure(figsize=(12, 6))

plt.plot(degrees, bias_squared_values, 'b-o', linewidth=2, markersize=8, label='Bias²')
plt.plot(degrees, variance_values, 'r-s', linewidth=2, markersize=8, label='Variance')
plt.plot(degrees, total_error_values, 'k-^', linewidth=3, markersize=8, label='Total Error')
plt.axhline(y=irreducible_error, color='gray', linestyle='--', alpha=0.5, label='Irreducible Error')

optimal_idx = np.argmin(total_error_values)
optimal_degree = degrees[optimal_idx]
plt.axvline(x=optimal_degree, color='green', linestyle='--', alpha=0.7, linewidth=2)
plt.scatter([optimal_degree], [total_error_values[optimal_idx]],
           s=300, c='green', marker='*', edgecolors='black', linewidths=2,
           label=f'Optimal (degree={optimal_degree})', zorder=5)

plt.text(2, max(total_error_values)*0.9, 'High Bias\nLow Variance',
        fontsize=11, ha='center', bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.7))
plt.text(13, max(total_error_values)*0.9, 'Low Bias\nHigh Variance',
        fontsize=11, ha='center', bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7))

plt.xlabel('Model Complexity (Polynomial Degree)', fontsize=13)
plt.ylabel('Error', fontsize=13)
plt.title('The Bias-Variance Tradeoff', fontsize=15, fontweight='bold')
plt.legend(fontsize=11, loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

../../../_images/75329a8386a81534f54526cf6ebc62818d4df84112f8c67794ad63d6896d624c.png

Key Observations:

As complexity increases, bias² decreases (better fit)
As complexity increases, variance increases (less stable)
Total error = Bias² + Variance + Irreducible
Optimal complexity: 2 (minimizes total error)
Can’t reduce both simultaneously - it’s a TRADEOFF

The Mathematical Decomposition#

Important

Error Decomposition Formula:

For a model’s prediction \(\hat{f}(x)\) compared to true value \(y\):

\[\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

Where:

Bias² = \((E[\hat{f}(x)] - f(x))^2\)
- How far is the average prediction from the truth?
Variance = \(E[(\hat{f}(x) - E[\hat{f}(x)])^2]\)
- How much do predictions vary across training sets?
Irreducible Error = Noise in the data itself (cannot be reduced)

../../../_images/90a7a0d3b8656dcc3b6ca56a600dc933b5ea9be10855d35667c163daf0e62bb3.png

At x = 5:

True value: 7.50

Simple model (degree 1):

Mean prediction: 5.08
Bias: -2.42
Variance: 0.21

Optimal model (degree 2):

Mean prediction: 7.46
Bias: -0.04
Variance: 0.23

Complex model (degree 10):

Mean prediction: 7.54
Bias: 0.04
Variance: 1.59

6.1.6.5. Connection to Model Complexity#

The bias-variance tradeoff is intimately connected to model complexity:

Model Type	Complexity	Bias	Variance	Best For
Simple (linear)	Low	High	Low	Simple patterns, small data
Medium (quadratic)	Medium	Medium	Medium	Moderate patterns, medium data
Complex (high-degree)	High	Low	High	Complex patterns, large data

Show code cell source

Hide code cell source

# Visualize predictions from multiple models

fig, axes = plt.subplots(2, 3, figsize=(16, 8))

degrees_to_show = [1, 3, 10]
titles = ['Simple Model\n(High Bias, Low Variance)',
         'Balanced Model\n(Medium Bias, Medium Variance)',
         'Complex Model\n(Low Bias, High Variance)']

for col, (degree, title) in enumerate(zip(degrees_to_show, titles)):
    ax_top = axes[0, col]

    # Train multiple models and plot predictions
    for trial in range(10):
        X_sample = np.sort(np.random.uniform(0, 10, 30)).reshape(-1, 1)
        y_sample = true_function(X_sample.ravel()) + np.random.normal(0, 2, 30)

        model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
        model.fit(X_sample, y_sample)

        X_plot = np.linspace(0, 10, 200).reshape(-1, 1)
        y_pred = model.predict(X_plot)

        ax_top.plot(X_plot, y_pred, alpha=0.3, linewidth=1, color='blue')

    ax_top.plot(X_plot, true_function(X_plot.ravel()), 'r--', linewidth=2, label='True')
    ax_top.set_title(title, fontsize=11, fontweight='bold')
    ax_top.set_ylim([-5, 25])
    ax_top.grid(True, alpha=0.3)
    ax_top.legend(fontsize=9)

    ax_bottom = axes[1, col]

    # Calculate bias and variance
    predictions = []
    for trial in range(50):
        X_sample = np.sort(np.random.uniform(0, 10, 30)).reshape(-1, 1)
        y_sample = true_function(X_sample.ravel()) + np.random.normal(0, 2, 30)

        model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
        model.fit(X_sample, y_sample)

        X_test = np.linspace(0, 10, 50).reshape(-1, 1)
        y_pred = model.predict(X_test)
        predictions.append(y_pred)

    predictions = np.array(predictions)
    mean_pred = predictions.mean(axis=0)
    y_true_test = true_function(X_test.ravel())

    bias_sq = np.mean((mean_pred - y_true_test)**2)
    var = np.mean(predictions.var(axis=0))
    irreducible = 4.0
    total = bias_sq + var + irreducible

    components = [bias_sq, var, irreducible]
    labels = ['Bias²', 'Variance', 'Irreducible']
    colors = ['#FF6B6B', '#4ECDC4', '#95A5A6']

    ax_bottom.bar(labels, components, color=colors, edgecolor='black', linewidth=2)
    ax_bottom.set_ylabel('Error', fontsize=11)
    ax_bottom.set_title(f'Total Error: {total:.2f}', fontsize=10)
    ax_bottom.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

../../../_images/ae4d574314d2a7175452ae9de1eab45e19ae61c08cad84c02a142513bf38b661.png

6.1.6.6. The No Free Lunch Theorem#

The bias-variance tradeoff leads to an important realization: There is no universally best model.

The No Free Lunch Theorem states: Averaged over all possible problems, every algorithm performs equally well.

What this means practically:

Simple models win on simple problems
Complex models win on complex problems
You need to match model complexity to problem complexity
Always validate on your specific data!

Show code cell source

Hide code cell source

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

problems = [
    {
        'name': 'Simple Linear',
        'func': lambda x: 2*x + 3,
        'best_model': 'Linear Regression'
    },
    {
        'name': 'Complex Non-Linear',
        'func': lambda x: 10*np.sin(x) + 0.5*x**2,
        'best_model': 'Random Forest'
    }
]

models = {
    'Linear': LinearRegression(),
    'Tree': DecisionTreeRegressor(max_depth=5, random_state=42),
    'Forest': RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42)
}

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for prob_idx, problem in enumerate(problems):
    # Generate data for each problem
    X = np.sort(np.random.uniform(0, 10, 100)).reshape(-1, 1)
    y = problem['func'](X.ravel()) + np.random.normal(0, 1, 100)

    X_train, X_test = X[:70], X[70:]
    y_train, y_test = y[:70], y[70:]

    # Plot data
    ax_data = axes[prob_idx, 0]
    ax_data.scatter(X_train, y_train, alpha=0.6, s=40, label='Train')
    ax_data.scatter(X_test, y_test, alpha=0.6, s=60, marker='s', label='Test')
    ax_data.plot(X, problem['func'](X.ravel()), 'k--', linewidth=2, label='True')
    ax_data.set_title(f"Problem: {problem['name']}", fontsize=12, fontweight='bold')
    ax_data.legend()
    ax_data.grid(True, alpha=0.3)

    ax_results = axes[prob_idx, 1]
    model_names = []
    test_errors = []

    # Train and evaluate each model
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = np.mean((y_test - y_pred)**2)

        model_names.append(name)
        test_errors.append(mse)

    colors = ['green' if name == problem['best_model'].split()[0] else 'gray'
             for name in model_names]

    ax_results.bar(model_names, test_errors, color=colors, edgecolor='black', linewidth=2)
    ax_results.set_ylabel('Test MSE', fontsize=11)
    ax_results.set_title(f'Best: {problem["best_model"]}', fontsize=11, fontweight='bold')
    ax_results.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

../../../_images/4a66a8e89f3c2575685eee0859bb9e1168c4206e21880b14ebba956516022495.png

No Free Lunch Theorem in Action:

Linear problem → Linear model wins
Complex problem → Complex model wins
No single ‘best’ model for all problems
Must validate on YOUR specific data!

6.1.6.7. Practical Implications#

Tip

How to Navigate the Bias-Variance Tradeoff:

Start simple - Begin with low-complexity models
Monitor both - Track training AND validation error
Increase complexity gradually - Add just enough
Use validation - Catch overfitting early
Regularization helps - Controls variance without sacrificing too much bias
More data helps variance - Large datasets allow complex models
Feature engineering helps bias - Better features reduce need for complexity

Practical Decision Guide:

Situation 1: High training error + High test error

High BIAS (underfitting)
Solution: Increase model complexity
Example: Try polynomial features, deeper tree, more layers

Situation 2: Low training error + High test error

High VARIANCE (overfitting)
Solution: Reduce complexity OR get more data
Example: Add regularization, prune tree, dropout, early stopping

Situation 3: Large gap between training and test

High VARIANCE
Model is memorizing rather than learning

Situation 4: Both errors decreasing with more data

You’re on the right track!
More data will help

Situation 5: Both errors plateau despite more data

High BIAS
Need more model capacity, not more data

Warning

Don’t Oversimplify!

While the bias-variance tradeoff is a powerful mental model, remember:

Real models have many sources of error
The tradeoff isn’t always perfect (regularization can help both!)
Always validate empirically on your data

The Bias-Variance Tradeoff

Contents

6.1.6. The Bias-Variance Tradeoff#

6.1.6.1. The Fundamental Question#

6.1.6.2. Bias: Error from Wrong Assumptions#

Characteristics of High Bias#

6.1.6.3. Variance: Error from Sensitivity to Training Data#

Characteristics of High Variance#

6.1.6.4. The Tradeoff: You Can’t Have Both Low#

The Mathematical Decomposition#

6.1.6.5. Connection to Model Complexity#

6.1.6.6. The No Free Lunch Theorem#

6.1.6.7. Practical Implications#