Regularization: Taming Complexity

6.1.7. Regularization: Taming Complexity#

In the previous sections, we explored overfitting and the bias–variance tradeoff. We saw how models that are too flexible can memorize noise instead of learning structure. Regularization is the natural next step. It provides a principled way to control complexity and guide models toward better generalization.

You can think of regularization as adding guardrails to a powerful model. A flexible model is capable of fitting almost anything, including randomness. Regularization keeps that flexibility in check without removing it entirely.

6.1.7.1. The Core Idea#

At its heart, regularization introduces a simple but powerful modification to the learning objective:

Important

Regularization = Penalizing Complexity

Instead of just minimizing error, we minimize: \( \text{Total Loss} = \text{Error} + \lambda \times \text{Complexity Penalty}\)

Where:

Error: How well the model fits training data
Complexity Penalty: How complex/flexible the model is
λ (lambda): Controls the tradeoff (hyperparameter)

Rather than asking only, “How well does the model fit the training data?”, we also ask, “How complex is this model?” The hyperparameter λ determines how strongly we care about simplicity. A larger value places more weight on the penalty, encouraging simpler models.

In practice:

Without regularization, the model minimizes only error.
With regularization, the model minimizes error plus a penalty.
The result is often better generalization to unseen data.

The following experiment illustrates this effect concretely.

This forces the model to find a balance between fitting the data and staying simple.

# Generate synthetic data

X = np.sort(np.random.uniform(0, 10, 30)).reshape(-1, 1)
true_function = lambda x: 5 + 2*x - 0.3*x**2
y = true_function(X.ravel()) + np.random.normal(0, 3, 30)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create polynomial features
poly = PolynomialFeatures(10, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Train models with and without regularization
model_unreg = LinearRegression()
model_unreg.fit(X_train_poly, y_train)

model_reg = Ridge(alpha=10.0)
model_reg.fit(X_train_poly, y_train)

/home/runner/work/datasciencethenovel/datasciencethenovel/.venv/lib/python3.13/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.1098e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)

Ridge(alpha=10.0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code cell source

Hide code cell source

fig, axes = plt.subplots(1, 3, figsize=(16, 4))

# Plot 1: Coefficients comparison
axes[0].bar(range(len(model_unreg.coef_)), model_unreg.coef_,
           alpha=0.6, label='Without regularization', color='red')
axes[0].bar(range(len(model_reg.coef_)), model_reg.coef_,
           alpha=0.6, label='With regularization', color='blue')
axes[0].set_xlabel('Feature Index', fontsize=11)
axes[0].set_ylabel('Coefficient Value', fontsize=11)
axes[0].set_title('Regularization Shrinks Coefficients', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

X_plot = np.linspace(0, 10, 200).reshape(-1, 1)
X_plot_poly = poly.transform(X_plot)
y_pred_unreg = model_unreg.predict(X_plot_poly)
y_pred_reg = model_reg.predict(X_plot_poly)

axes[1].scatter(X_train, y_train, alpha=0.6, s=60, label='Training data', color='blue')
axes[1].scatter(X_test, y_test, alpha=0.6, s=100, marker='s', label='Test data', color='green')
axes[1].plot(X_plot, y_pred_unreg, 'r-', linewidth=2, label='Without regularization', alpha=0.7)
axes[1].plot(X_plot, y_pred_reg, 'b-', linewidth=2, label='With regularization')
axes[1].plot(X_plot, true_function(X_plot.ravel()), 'k--', linewidth=2, alpha=0.5, label='True function')
axes[1].set_xlabel('X', fontsize=11)
axes[1].set_ylabel('y', fontsize=11)
axes[1].set_title('Regularization Prevents Wild Predictions', fontsize=12, fontweight='bold')
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim([y.min()-10, y.max()+10])

train_error_unreg = mean_squared_error(y_train, model_unreg.predict(X_train_poly))
test_error_unreg = mean_squared_error(y_test, model_unreg.predict(X_test_poly))
train_error_reg = mean_squared_error(y_train, model_reg.predict(X_train_poly))
test_error_reg = mean_squared_error(y_test, model_reg.predict(X_test_poly))

x_pos = [0, 1]
width = 0.35
axes[2].bar([p - width/2 for p in x_pos], [train_error_unreg, test_error_unreg],
           width, label='Without regularization', color='red', alpha=0.7)
axes[2].bar([p + width/2 for p in x_pos], [train_error_reg, test_error_reg],
           width, label='With regularization', color='blue', alpha=0.7)
axes[2].set_xticks(x_pos)
axes[2].set_xticklabels(['Train Error', 'Test Error'])
axes[2].set_ylabel('MSE', fontsize=11)
axes[2].set_title('Regularization Improves Test Error', fontsize=12, fontweight='bold')
axes[2].legend()
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

../../../_images/1ad986cc4497f7a4cc9c4abe71e42f7dfb35bd1494aaeaea4201d52c697f83ba.png

After training both models, we observe a clear pattern:

Without regularization:

Max coefficient magnitude: 226.94
Train MSE: 3.84
Test MSE: 880.95

With regularization:

Max coefficient magnitude: 0.22
Train MSE: 6.35
Test MSE: 11.12

Regularization reduces coefficient magnitudes and improves test performance. The model becomes less extreme, and as a result, more reliable.

6.1.7.2. Why Regularization Works#

Overfitting arises when a model is too flexible relative to the amount of available data. High flexibility means many parameters. With limited data, those parameters can take extreme values to perfectly match noise.

Regularization counteracts this tendency.

The mechanism is intuitive:

Complex models contain many parameters.
With limited data, parameters may grow very large to fit small fluctuations.
Regularization penalizes large parameter values.
The model still captures structure but avoids chasing noise.

The penalty acts as a soft constraint. It does not forbid complexity outright. Instead, it makes complexity costly.

6.1.7.3. L2 Regularization (Ridge): Shrinking Weights#

Ridge regression introduces an L2 penalty, which is the sum of squared coefficients.

[ \text{Loss}{\text{Ridge}} = \text{MSE} + \alpha \sum{i=1}^{n} w_i^2 ]

Because the penalty squares the coefficients, larger values are penalized more heavily. The effect is smooth shrinkage: coefficients move toward zero but rarely become exactly zero.

Intuition#

Large coefficients become expensive.
All coefficients shrink proportionally.
Variance decreases.
Overfitting is reduced.

The next example demonstrates how increasing α affects both coefficients and error.

from sklearn.linear_model import Ridge

# Generate synthetic data
np.random.seed(42)
n_samples, n_features = 50, 20

X = np.random.randn(n_samples, n_features)
true_coef = np.zeros(n_features)
true_coef[:3] = [5, -3, 2]
y = X @ true_coef + np.random.randn(n_samples) * 2

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

alphas = [0, 0.1, 1, 10, 100]
models = {}

../../../_images/9e23d7a43dea97494932f50b5eeed4b0be71c35d7d1dd1bb3bcdbc64c364bdb1.png

From the experiment:

Key Observations:

α = 0 produces large coefficients and overfitting.
Small α introduces mild shrinkage.
Moderate α achieves the best test performance.
Large α forces all coefficients near zero, causing underfitting.

Optimal α for this data: ~1

Ridge regression is particularly useful when features are correlated. Instead of eliminating variables, it distributes weight across them more conservatively.

Tip

Use Ridge (L2) when:

You have many correlated features
You want to keep all features but reduce their impact
You have more features than samples
Features are on different scales (combine with StandardScaler!)

6.1.7.4. L1 Regularization (Lasso): Sparse Solutions#

Lasso regression replaces the squared penalty with an absolute value penalty:

[ \text{Loss}{\text{Lasso}} = \text{MSE} + \alpha \sum{i=1}^{n} |w_i| ]

This small mathematical change has a dramatic effect. The L1 penalty encourages exact zeros in the coefficient vector.

Intuition#

Any nonzero coefficient incurs a cost.
Many coefficients collapse to exactly zero.
The model performs automatic feature selection.

In contrast to Ridge, which shrinks everything smoothly, Lasso simplifies the model structurally.

../../../_images/c3d85643a6daad922b623f398e656c3d3d46a867efb076874dd86e9925647ee5.png

Lasso Feature Selection:

α=0.01: Selected features: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19], Number: 18/20
α=0.1: Selected features: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 14, 15, 17], Number: 14/20
α=0.5: Selected features: [0, 1, 2, 3, 9], Number: 5/20 ✓ Correctly identified first 3 features!
α=1.0: Selected features: [0, 1, 2], Number: 3/20
α=5.0: Selected features: [0], Number: 1/20

Lasso is especially attractive when interpretability matters. A sparse model is easier to understand and communicate.

6.1.7.5. Ridge vs Lasso#

When comparing Ridge and Lasso directly, their philosophical difference becomes clear.

Show code cell source

Hide code cell source

alpha_compare = 1.0

fig, axes = plt.subplots(2, 2, figsize=(14, 8))

## Train Ridge and Lasso for comparison
ridge = Ridge(alpha=alpha_compare)
ridge.fit(X_train, y_train)

lasso = Lasso(alpha=alpha_compare, max_iter=10000)
lasso.fit(X_train, y_train)

axes[0, 0].stem(range(n_features), ridge.coef_, linefmt='b-', markerfmt='bo', label='Ridge', basefmt=' ')
axes[0, 0].plot(range(n_features), true_coef, 'k*', markersize=12, label='True')
axes[0, 0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0, 0].set_xlabel('Feature Index', fontsize=11)
axes[0, 0].set_ylabel('Coefficient', fontsize=11)
axes[0, 0].set_title(f'Ridge (α={alpha_compare}): All Non-Zero', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3, axis='y')

axes[0, 1].stem(range(n_features), lasso.coef_, linefmt='r-', markerfmt='ro', label='Lasso', basefmt=' ')
axes[0, 1].plot(range(n_features), true_coef, 'k*', markersize=12, label='True')
axes[0, 1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0, 1].set_xlabel('Feature Index', fontsize=11)
axes[0, 1].set_ylabel('Coefficient', fontsize=11)
axes[0, 1].set_title(f'Lasso (α={alpha_compare}): Sparse', fontsize=12, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3, axis='y')

coef_mag_ridge = np.abs(ridge.coef_)
coef_mag_lasso = np.abs(lasso.coef_)

axes[1, 0].bar(range(n_features), coef_mag_ridge, color='blue', alpha=0.6, label='Ridge')
axes[1, 0].set_xlabel('Feature Index', fontsize=11)
axes[1, 0].set_ylabel('|Coefficient|', fontsize=11)
axes[1, 0].set_title('Ridge: Small But Non-Zero', fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3, axis='y')

axes[1, 1].bar(range(n_features), coef_mag_lasso, color='red', alpha=0.6, label='Lasso')
axes[1, 1].set_xlabel('Feature Index', fontsize=11)
axes[1, 1].set_ylabel('|Coefficient|', fontsize=11)
axes[1, 1].set_title('Lasso: Many Exactly Zero', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

../../../_images/9207040245a526fd75e4a2427dcc39cc14e073642eed98dc62feab5da76e10d4.png

Ridge vs Lasso:

Ridge (L2):

Non-zero coefficients: 20/20
Max coefficient: 5.296
Test MSE: 6.634

Lasso (L1):

Non-zero coefficients: 3/20
Max coefficient: 4.161
Test MSE: 11.113

Aspect	Ridge (L2)	Lasso (L1)
Penalty	Sum of squared coefficients	Sum of absolute coefficients
Effect on coefficients	Shrinks all toward zero	Sets many to exactly zero
Feature selection	No	Yes (automatic)
When to use	Many correlated features	Suspect many features irrelevant
Interpretability	All features contribute	Sparse, easy to interpret
Computational	Has closed-form solution	Requires iterative optimization

Ridge preserves all features with reduced magnitude. Lasso removes many features entirely.

The choice depends on whether you value stability across correlated features or sparsity and interpretability.

6.1.7.6. Elastic Net: Best of Both Worlds#

Elastic Net combines both penalties:

[ \text{Loss}_{\text{ElasticNet}} = \text{MSE} + \alpha \rho \sum |w_i| + \alpha (1-\rho) \sum w_i^2 ]

Here:

α controls overall regularization strength.
ρ controls the balance between L1 and L2.

When ρ = 0, the model behaves like Ridge. When ρ = 1, it behaves like Lasso.

Elastic Net is particularly useful when features are highly correlated and sparsity is still desirable. It tends to be more stable than pure Lasso while still producing simpler models.

../../../_images/a46e9cc747d0913e8e4290b50f54670da4de8704a04d19aa8cc4a614b81803a6.png

Results:

Ridge: np.int64(20) non-zero coefficients
Lasso: np.int64(5) non-zero coefficients
Elastic Net: np.int64(11) non-zero coefficients

Elastic Net balances sparsity and stability

6.1.7.7. Choosing the Regularization Strength α#

The penalty form matters, but the strength of the penalty often matters even more.

α = 0 means no regularization.
Small α introduces mild control.
Large α forces simplicity and risks underfitting.

Selecting α should not rely on guesswork. Cross-validation provides a systematic approach.

How to find the optimal α? Cross-validation!

LassoCV(alphas=array([1.00000000e-03, 1.32571137e-03, 1.75751062e-03, 2.32995181e-03,
       3.08884360e-03, 4.09491506e-03, 5.42867544e-03, 7.19685673e-03,
       9.54095476e-03, 1.26485522e-02, 1.67683294e-02, 2.22299648e-02,
       2.94705170e-02, 3.90693994e-02, 5.17947468e-02, 6.86648845e-02,
       9.10298178e-02, 1.20679264e-01, 1.59985872e-01, 2.12095089e-01,
       2.81176870e-01, 3.72759372e-0...
       2.68269580e+00, 3.55648031e+00, 4.71486636e+00, 6.25055193e+00,
       8.28642773e+00, 1.09854114e+01, 1.45634848e+01, 1.93069773e+01,
       2.55954792e+01, 3.39322177e+01, 4.49843267e+01, 5.96362332e+01,
       7.90604321e+01, 1.04811313e+02, 1.38949549e+02, 1.84206997e+02,
       2.44205309e+02, 3.23745754e+02, 4.29193426e+02, 5.68986603e+02,
       7.54312006e+02, 1.00000000e+03]),
        cv=5, max_iter=10000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Finding Optimal Regularization Strength

Ridge best α: 2.6827
Lasso best α: 0.2812

../../../_images/8f5a5d1847402c1363fd5af35b48148322586758c6a1b16d3d4346e7b4b2ce04.png

Cross-validation curves make the tradeoff visible: too little regularization increases variance, too much increases bias. The minimum point balances both.

Optimal models found via cross-validation:

Ridge test MSE: 6.989
Lasso test MSE: 5.981

Tip

Best Practice for Choosing α:

Start with a wide logarithmic range.
Use cross-validation tools such as RidgeCV or LassoCV.
For Elastic Net, tune both α and l1_ratio.
Visualize the validation curve.
Evaluate the final choice on a held-out test set.

6.1.7.8. Beyond Linear Models#

Regularization is not limited to linear regression. The principle of controlling complexity appears across machine learning.

Dropout#

In neural networks, dropout randomly disables neurons during training. This prevents units from becoming overly dependent on one another and improves generalization.

Dropout: Randomly “drop” (set to zero) neurons during training

Effect: Prevents co-adaptation of neurons, reduces overfitting

## Conceptual example (PyTorch style)
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Dropout(p=0.5),  ## Drop 50% of neurons
    nn.Linear(50, 10)
)

Early Stopping#

Another form of regularization does not modify the objective function at all. Instead, it limits training duration.

As training proceeds, training error often decreases steadily. Validation error, however, eventually begins to increase. Stopping at the minimum validation error prevents the model from overfitting.

Early Stopping: Stop training when validation error stops improving

Effect: Prevents model from overfitting by not training too long

../../../_images/8212abe53a74c2e37e5d9b67c10d3a12700189e4658ee6ee2d94bd802e187e88.png

Early Stopping:

Best validation error at epoch: np.int64(30)
If we continued to epoch 100:
- Training error would be: 0.57 (keeps improving)
- Validation error would be: 5.80 (gets worse!)
- Early stopping prevents overfitting without modifying the model

Data Augmentation#

Data augmentation increases the effective size of the dataset by generating new examples from existing ones. By exposing the model to more variation, it reduces overfitting without changing the model structure.

Examples include:

Image transformations such as rotation and flipping
Text transformations such as synonym replacement
Time series perturbations such as noise injection

Effect: Increases effective dataset size, reduces overfitting

Examples:

Images: Rotate, flip, crop, color jitter
Text: Synonym replacement, back-translation
Time series: Time warping, noise injection

6.1.7.9. Practical Guidelines#

Note

When to Use Which Regularization:

Use Ridge (L2) when:

All features may be relevant
Features are correlated
Smooth shrinkage is desired

Use Lasso (L1) when:

Many features are likely irrelevant
Interpretability is important
Automatic feature selection is needed

Use Elastic Net when:

You want both stability and sparsity
Correlated feature groups exist
Features outnumber samples

Use Dropout when:

Training deep neural networks

Use Early Stopping when:

Training iterative models
Validation error begins to increase
Computational resources are limited

	alpha	10.0
	fit_intercept	True
	copy_X	True
	max_iter	None
	tol	0.0001
	solver	'auto'
	positive	False
	random_state	None

	eps	0.001
	n_alphas	'deprecated'
	alphas	array([1.0000...00000000e+03])
	fit_intercept	True
	precompute	'auto'
	max_iter	10000
	tol	0.0001
	copy_X	True
	cv	5
	verbose	False
	n_jobs	None
	positive	False
	random_state	None
	selection	'cyclic'

Regularization: Taming Complexity

Contents

6.1.7. Regularization: Taming Complexity#

6.1.7.1. The Core Idea#

6.1.7.2. Why Regularization Works#

6.1.7.3. L2 Regularization (Ridge): Shrinking Weights#

Intuition#

6.1.7.4. L1 Regularization (Lasso): Sparse Solutions#

Intuition#

6.1.7.5. Ridge vs Lasso#

6.1.7.6. Elastic Net: Best of Both Worlds#

6.1.7.7. Choosing the Regularization Strength α#

6.1.7.8. Beyond Linear Models#

Dropout#

Early Stopping#

Data Augmentation#

6.1.7.9. Practical Guidelines#