6.4.3. Hyperparameter Tuning: Finding Optimal Settings#

You’ve trained a Random Forest with default parameters. Performance: 85%. But what if n_estimators=200 gives 92%?

Hyperparameters are settings you choose before training (unlike parameters learned during training). Examples:

  • Tree depth in Decision Trees

  • Number of neighbors in KNN

  • Learning rate in Neural Networks

  • Regularization strength

Problem: How to find the best settings?

Let’s explore systematic tuning strategies from simple to advanced!

6.4.3.1. Hyperparameters vs Parameters#

Key distinction:

  • Parameters: Learned from data (e.g., weights, coefficients)

  • Hyperparameters: Set before training (e.g., learning rate, tree depth)

Hide code cell source

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from myst_nb import glue
from sklearn.datasets import load_breast_cancer, load_digits
from sklearn.model_selection import (train_test_split, GridSearchCV,
                                      RandomizedSearchCV, cross_val_score)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from scipy.stats import uniform, randint
import time

np.random.seed(42)

6.4.3.2. Example: Impact of Hyperparameters#

Let’s see how much hyperparameters matter!

# Load data
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

X_train, X_test, y_train, y_test = train_test_split(
    X_cancer, y_cancer, test_size=0.3, random_state=42, stratify=y_cancer
)

# Try different max_depth values
depths = [1, 2, 3, 5, 10, 20, None]
results = []

for depth in depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    results.append({
        'max_depth': str(depth),
        'Train Accuracy': round(dt.score(X_train, y_train), 3),
        'Test Accuracy':  round(dt.score(X_test,  y_test),  3),
    })

df_results = pd.DataFrame(results)
display(df_results)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
depth_labels = [str(d) for d in depths]
x_pos = range(len(depth_labels))

axes[0].plot(x_pos, df_results['Train Accuracy'], 'o-', linewidth=2, markersize=8, label='Training')
axes[0].plot(x_pos, df_results['Test Accuracy'],  's-', linewidth=2, markersize=8, label='Test')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(depth_labels)
axes[0].set_xlabel('max_depth', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Hyperparameter Impact\nmax_depth affects performance!', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

gaps = df_results['Train Accuracy'] - df_results['Test Accuracy']
axes[1].bar(x_pos, gaps, alpha=0.7, edgecolor='black')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(depth_labels)
axes[1].set_xlabel('max_depth', fontsize=12)
axes[1].set_ylabel('Train-Test Gap', fontsize=12)
axes[1].set_title('Overfitting vs Depth\nLarge gap = overfitting', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()
max_depth Train Accuracy Test Accuracy
0 1 0.927 0.912
1 2 0.965 0.918
2 3 0.980 0.924
3 5 0.995 0.930
4 10 1.000 0.918
5 20 1.000 0.918
6 None 1.000 0.918
../../../_images/dd4b2f05b408acb3ff31b24519a3568954368956b18cd837adbd168739e49e00.png

6.4.3.4. Visualizing Grid Search Results#

results_subset = cv_results[cv_results['param_min_samples_leaf'] == 1]

depths_unique = sorted(results_subset['param_max_depth'].unique())
splits_unique = sorted(results_subset['param_min_samples_split'].unique())

# Create heatmap matrix
heatmap_data = np.zeros((len(splits_unique), len(depths_unique)))

for i, split in enumerate(splits_unique):
    for j, depth in enumerate(depths_unique):
        mask = ((results_subset['param_max_depth'] == depth) &
                (results_subset['param_min_samples_split'] == split))
        score = results_subset[mask]['mean_test_score'].values[0]
        heatmap_data[i, j] = score

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap
im = axes[0].imshow(heatmap_data, cmap='RdYlGn', aspect='auto',
                    vmin=0.9, vmax=1.0)
axes[0].set_xticks(range(len(depths_unique)))
axes[0].set_yticks(range(len(splits_unique)))
axes[0].set_xticklabels(depths_unique)
axes[0].set_yticklabels(splits_unique)
axes[0].set_xlabel('max_depth', fontsize=12)
axes[0].set_ylabel('min_samples_split', fontsize=12)
axes[0].set_title('Grid Search Heatmap\n(Darker green = better)',
                  fontsize=13, fontweight='bold')

# Add text annotations
for i in range(len(splits_unique)):
    for j in range(len(depths_unique)):
        text = axes[0].text(j, i, f'{heatmap_data[i, j]:.3f}',
                          ha="center", va="center", color="black",
                          fontsize=9)

plt.colorbar(im, ax=axes[0], label='CV Accuracy')

# Score distribution
all_scores = cv_results['mean_test_score']

axes[1].hist(all_scores, bins=20, alpha=0.7, edgecolor='black')
axes[1].axvline(grid_search.best_score_, color='red', linestyle='--',
               linewidth=2, label=f'Best: {grid_search.best_score_:.3f}')
axes[1].set_xlabel('CV Accuracy', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Hyperparameter Combinations',
                  fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

6.4.3.5. Random Search: Efficient Alternative#

Random Search: Sample random combinations from distributions

Why it works:

  • Not all hyperparameters equally important

  • Random search explores more values for important ones

  • Often finds good solution faster than grid search

When to use: Large search space, limited computation

param_distributions = {
    'max_depth':          randint(3, 30),
    'min_samples_split':  randint(2, 20),
    'min_samples_leaf':   randint(1, 10),
    'max_features':       uniform(0.1, 0.9),
}

start_rs = time.time()
random_search = RandomizedSearchCV(
    dt_base, param_distributions, n_iter=50,
    cv=5, scoring='accuracy', n_jobs=-1, random_state=42, return_train_score=True
)
random_search.fit(X_train, y_train)
elapsed_random = time.time() - start_rs

best_params_df = pd.DataFrame([
    {'Hyperparameter': k, 'Best Value': round(v, 3) if isinstance(v, float) else v}
    for k, v in random_search.best_params_.items()
])
display(best_params_df)

comparison_df = pd.DataFrame([
    {'Method': 'Grid Search',   'Time (s)': round(elapsed, 2),        'Best CV Score': round(grid_search.best_score_,   3)},
    {'Method': 'Random Search', 'Time (s)': round(elapsed_random, 2), 'Best CV Score': round(random_search.best_score_, 3)},
])
display(comparison_df)

6.4.3.6. Nested Cross-Validation: Unbiased Evaluation#

Problem: Reporting CV score from GridSearchCV is biased (overfitted to validation set)

Solution: Nested CV

  • Outer loop: Estimate true performance

  • Inner loop: Hyperparameter tuning

# NON-nested (biased estimate) — GridSearchCV wrapped in cross_val_score
grid_search_nested = GridSearchCV(dt_base, param_grid, cv=5)
cv_scores_biased = cross_val_score(grid_search_nested, X_train, y_train, cv=5)

# True nested CV uses the outer loop for performance, inner for tuning
# (sklearn does this correctly when you pass a GridSearchCV to cross_val_score)
nested_df = pd.DataFrame([
    {'Approach': 'Non-nested (biased)',    'Mean CV Accuracy': round(cv_scores_biased.mean(), 3), 'Std': round(cv_scores_biased.std(), 3), 'Note': 'Optimistic bias'},
    {'Approach': 'Best practice: nested', 'Mean CV Accuracy': '—',                               'Std': '—',                              'Note': 'Outer loop = performance, inner loop = tuning'},
])
display(nested_df)

Warning

Never report GridSearchCV.best_score_ as final performance — it is the score on the same folds used to select the hyperparameters. Use a held-out test set or proper nested CV instead.

6.4.3.7. Real-World Example: Random Forest Tuning#

Let’s tune a Random Forest with multiple hyperparameters!

digits = load_digits()
X_digits, y_digits = digits.data, digits.target

X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(
    X_digits, y_digits, test_size=0.3, random_state=42, stratify=y_digits
)

rf_default = RandomForestClassifier(random_state=42)
rf_default.fit(X_train_rf, y_train_rf)
default_score = rf_default.score(X_test_rf, y_test_rf)

param_dist_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

random_search_rf = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist_rf, n_iter=100, cv=3, scoring='accuracy', n_jobs=-1, random_state=42
)
random_search_rf.fit(X_train_rf, y_train_rf)
tuned_score = random_search_rf.score(X_test_rf, y_test_rf)

glue('rf-default-score', round(default_score, 3), display=False)
glue('rf-tuned-score',   round(tuned_score,   3), display=False)

display(pd.DataFrame([
    {'Model': 'Default Random Forest', 'Test Accuracy': round(default_score, 3)},
    {'Model': 'Tuned Random Forest',   'Test Accuracy': round(tuned_score,   3)},
    {'Model': 'Best params',           'Test Accuracy': str(random_search_rf.best_params_)},
]))

Tuning lifts accuracy from to .

6.4.3.8. Key Takeaways#

Important

Remember These Points:

  1. Hyperparameters Matter

    • Can double performance!

    • Default rarely optimal

    • Always tune systematically

  2. Grid Search

    • Exhaustive search

    • Use for small grids (< 100 combos)

    • Guaranteed to find best in grid

  3. Random Search

    • Sample random combinations

    • More efficient for large spaces

    • Often finds good solution faster

  4. Search Space Design

    • Start broad, then narrow

    • Use domain knowledge

    • Log scale for learning rates

  5. Validation Strategy

    • Use CV (stratified if imbalanced)

    • Nested CV for unbiased estimates

    • Never tune on test set!

  6. Computational Efficiency

    • Parallel execution (n_jobs=-1)

    • Random Search for large spaces

    • Reduce CV folds if needed

  7. Reporting

    • Report test set performance

    • Document best hyperparameters

    • Include search time