Regularized Regression

6.2.1.3. Regularized Regression#

When a linear model has many features - especially when they are correlated or only weakly relevant - Ordinary Least Squares tends to overfit: it assigns large, noisy coefficients to chase training-set variation that does not generalise.

Regularization fixes this by adding a penalty on the magnitude of the coefficients directly to the loss function. The optimiser is now forced to balance fitting the data and keeping the coefficients small. The result is a model that is slightly biased but has much lower variance, and usually generalises much better.

There are three main variants:

Variant	Penalty	Key behaviour
Ridge (L2)	Sum of squared coefficients	Shrinks all coefficients toward zero uniformly; none become exactly zero
Lasso (L1)	Sum of absolute coefficients	Drives some coefficients to exactly zero - automatic feature selection
Elastic Net	Weighted L1 + L2	Combines both: sparse like Lasso, stable with correlated features like Ridge

The Math#

All three regularized models minimise the same base MSE loss (as in Linear Regression), plus a penalty term controlled by the hyperparameter \(\alpha\).

Ridge (L2)#

\[\text{Loss}_{\text{Ridge}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} \beta_j^2\]

The L2 penalty creates a smooth penalty landscape - no direction is treated differently, so all coefficients shrink together. The solution remains unique even when features are correlated.

Lasso (L1)#

\[\text{Loss}_{\text{Lasso}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} |\beta_j|\]

The L1 penalty has sharp corners at zero in each dimension. This geometric property means the optimum frequently lands exactly on zero for some coefficients, producing sparse solutions.

Elastic Net#

\[\text{Loss}_{\text{EN}} = \text{MSE} + \alpha \left[ \rho \sum_{j=1}^{p}|\beta_j| + \frac{1-\rho}{2} \sum_{j=1}^{p}\beta_j^2 \right]\]

l1_ratio \(\rho = 1\) recovers Lasso; \(\rho = 0\) recovers Ridge. A value like \(0.5\) gives an equal blend.

In scikit-learn#

from sklearn.linear_model import Ridge, Lasso, ElasticNet

ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.5, max_iter=5000)
enet  = ElasticNet(alpha=0.5, l1_ratio=0.5, max_iter=5000)

Key hyperparameters:

alpha - regularization strength; higher → stronger shrinkage (use RidgeCV / LassoCV to tune)
l1_ratio (Elastic Net only) - blend between L1 and L2

Example#

ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.5, max_iter=5000)
enet  = ElasticNet(alpha=0.5, l1_ratio=0.5, max_iter=5000)

ridge_r2, ridge_rmse = test_stats(ridge)
lasso_r2, lasso_rmse = test_stats(lasso)
enet_r2,  enet_rmse  = test_stats(enet)

pd.DataFrame({
    "Model":     ["Ridge (α=1.0)", "Lasso (α=0.5)", "Elastic Net (α=0.5, ρ=0.5)"],
    "Test R²":   [ridge_r2, lasso_r2, enet_r2],
    "Test RMSE": [ridge_rmse, lasso_rmse, enet_rmse],
})

	Model	Test R²	Test RMSE
0	Ridge (α=1.0)	0.978	26.3
1	Lasso (α=0.5)	0.978	26.6
2	Elastic Net (α=0.5, ρ=0.5)	0.924	49.1

Ridge achieves \(R^2\) = 0.978, Lasso 0.978, and Elastic Net 0.924. The dataset has only 6 informative features out of 10, so Lasso’s feature-selection behaviour is particularly relevant here.

Coefficient Paths - How \(\alpha\) Changes the Solution#

As \(\alpha\) increases, more regularization is applied and all coefficients shrink. For Lasso, they hit zero one by one:

../../../../_images/597789ae854d85ef9749bd5d90d13b335ca321312b3e650c7df386b3fe6651cb.png

In the Ridge plot, every line smoothly approaches zero but never fully reaches it. In the Lasso plot, lines hit zero and stay there - those are features being excluded from the model entirely.

How Many Features Does Lasso Keep?#

non_zero_by_alpha = [(np.sum(lasso_coefs[i] != 0)) for i in range(len(alphas))]

plt.figure(figsize=(9, 4))
plt.plot(np.log10(alphas), non_zero_by_alpha, "o-", markersize=4, linewidth=2, color="steelblue")
plt.xlabel("log₁₀(α)", fontsize=12)
plt.ylabel("Non-zero coefficients", fontsize=12)
plt.title("Lasso - Feature Selection as α Increases", fontsize=13, fontweight="bold")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

../../../../_images/bef6b04f4aa2c9179dcbfef89eea655f3af9e1ae8975ffb93869211d5c6660e2.png

# Count non-zero coefficients in the fitted Lasso (α=0.5)
n_selected = int(np.sum(lasso.coef_ != 0))
glue("lasso-n-selected", n_selected, display=False)

With \(\alpha = 0.5\), Lasso retains 9 of the 10 features, correctly discarding most of the non-informative ones.

Ridge vs Lasso - Side-by-Side Coefficient Comparison#

../../../../_images/75be50d9c24fd27ec68cf9246192307cefc181cb222c6026354cb9690bbfaf75.png

Choosing Between Ridge, Lasso, and Elastic Net#

Situation	Recommended
Many features, each contributes a little	Ridge - keeps all features, stable
Few features truly matter (sparse signal)	Lasso - zeroes out irrelevant features
Correlated features + you want sparsity	Elastic Net - handles correlations better than pure Lasso
Unsure	Elastic Net - safe default that interpolates between both

Tip

Use RidgeCV, LassoCV, or ElasticNetCV to automatically select alpha via cross-validation instead of guessing.