Metrics and Loss Functions

6.2.1.1. Metrics and Loss Functions#

Before training any regression model, we need a way to answer a fundamental question: how good is a prediction? Two numbers are involved - the true value \(y\) and the predicted value \(\hat{y}\) - and we need a single number that captures how far apart they are.

This single number is called a loss (or cost) when used to guide training, and a metric when used to evaluate a trained model. The distinction matters:

Loss functions need to be mathematically convenient for optimisation (e.g. differentiable). The model minimises the loss during training.
Evaluation metrics are chosen to be interpretable and meaningful for the problem at hand. They are computed after training to communicate model quality.

Loss Functions#

Mean Squared Error (MSE)#

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

MSE is the workhorse of regression. Squaring the errors has two effects: it makes the result always non-negative, and it penalises large errors disproportionately - a prediction that is 10 units off contributes 100 times as much as one that is 1 unit off. This makes MSE sensitive to outliers.

Properties:

Differentiable everywhere → easy to optimise with gradient descent
Strongly penalises large deviations
Units are squared (e.g. dollars²) - less intuitive

def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

pd.DataFrame({
    "Predictions": ["Good", "Poor", "With one outlier"],
    "MSE": [mse(y_true, y_pred_good), mse(y_true, y_pred_poor), mse(y_true, y_pred_outlier)],
}).round(2)

	Predictions	MSE
0	Good	6.04
1	Poor	256.19
2	With one outlier	271.80

On the example data, good predictions yield an MSE of 6.04; introducing a single outlier inflates this to 271.8 - a dramatic increase caused by squaring one large residual.

Root Mean Squared Error (RMSE)#

\[\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\]

RMSE is simply the square root of MSE, restoring the original units. It is the most commonly reported regression error because it is on the same scale as the target variable and still penalises large errors.

Properties:

Same units as \(y\) → directly interpretable
Still sensitive to outliers (inherits MSE’s squared penalty)
Lower is better; RMSE = 0 means perfect predictions

def rmse(y_true, y_pred):
    return np.sqrt(mse(y_true, y_pred))

pd.DataFrame({
    "Predictions": ["Good", "Poor", "With one outlier"],
    "RMSE": [rmse(y_true, y_pred_good), rmse(y_true, y_pred_poor), rmse(y_true, y_pred_outlier)],
}).round(2)

	Predictions	RMSE
0	Good	2.46
1	Poor	16.01
2	With one outlier	16.49

Good predictions give an RMSE of 2.46; one outlier pushes it to 16.49 - inheriting the same sensitivity as MSE.

Mean Absolute Error (MAE)#

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\]

MAE measures the average absolute deviation between predictions and ground truth. Every error, large or small, is penalised in direct proportion to its magnitude.

Properties:

Robust to outliers - one extreme error does not dominate the average
Same units as \(y\)
Not differentiable at zero (requires subgradient methods to optimise)

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))


pd.DataFrame({
    "Predictions": ["Good", "Poor", "With one outlier"],
    "MAE": [mae(y_true, y_pred_good), mae(y_true, y_pred_poor), mae(y_true, y_pred_outlier)],
}).round(2)

	Predictions	MAE
0	Good	1.99
1	Poor	13.53
2	With one outlier	13.79

With one outlier, MAE rises to 13.79 - compare this with RMSE = 16.49 under the same conditions, illustrating MAE’s greater robustness.

Huber Loss#

The Huber loss blends MSE and MAE: it behaves like MSE for small errors (where differentiability is useful) and like MAE for large errors (where robustness matters). The threshold \(\delta\) controls the boundary.

\[\begin{split}\mathcal{L}_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\[6pt] \delta \left(|y - \hat{y}| - \frac{\delta}{2}\right) & \text{otherwise} \end{cases}\end{split}\]

Key idea:

Small residuals → MSE-like: smooth gradient, precise optimisation
Large residuals → MAE-like: bounded gradient, robust to outliers

def huber_loss(y_true, y_pred, delta=1.0):
    residuals = np.abs(y_true - y_pred)
    return np.where(
        residuals <= delta,
        0.5 * residuals**2,
        delta * (residuals - 0.5 * delta)
    ).mean()

../../../../_images/415d33d79156c4158c4ecbb105e4263f25e19d83e56cc0ed54a5cc54db05c00d.png

Evaluation Metrics#

Evaluation metrics are used after training to communicate model quality, often to stakeholders. They need to be interpretable, not necessarily optimisable.

R² - Coefficient of Determination#

\[R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}\]

\(R^2\) answers: “What fraction of the variance in \(y\) does the model explain?”

\(R^2\) value	Interpretation
1.0	Perfect predictions
0.8	Model explains 80% of variance
0.0	Model is no better than predicting the mean
< 0	Model is worse than predicting the mean

r2_good    = r2_score(y_true, y_pred_good)
r2_poor    = r2_score(y_true, y_pred_poor)
r2_outlier = r2_score(y_true, y_pred_outlier)

pd.DataFrame({
    "Predictions": ["Good", "Poor", "With one outlier"],
    "R²": [r2_good, r2_poor, r2_outlier],
}).round(3)

	Predictions	R²
0	Good	0.993
1	Poor	0.689
2	With one outlier	0.671

The good-prediction model explains 0.993 of the variance in \(y\) (\(R^2 = \) 0.993). Introducing a single outlier drops this to 0.671, showing how \(R^2\) can be pulled down by individual extreme errors.

RMSE vs MAE - When to Use Which#

	Dataset	RMSE	MAE	RMSE/MAE
0	Clean (no outliers)	5.05	4.05	1.25
1	With 5 outliers	15.69	7.17	2.19

On clean data the RMSE/MAE ratio is 1.25 (close to 1, indicating no large outliers); injecting five large outliers raises it to 2.19. Use this ratio as a quick diagnostic: if RMSE ≫ MAE, outliers are likely distorting your error estimate.

MAPE - Mean Absolute Percentage Error#

\[\text{MAPE} = \frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|\]

MAPE expresses error as a percentage of the true value, making it scale-independent and easy to communicate (“predictions are off by 8% on average”).

Caution: MAPE is undefined when \(y_i = 0\) and is asymmetric - it penalises under-predictions more than over-predictions.

mape_good = mean_absolute_percentage_error(y_true, y_pred_good) * 100
mape_poor = mean_absolute_percentage_error(y_true, y_pred_poor) * 100

pd.DataFrame({
    "Predictions": ["Good", "Poor"],
    "MAPE (%)": [mape_good, mape_poor],
}).round(1)

	Predictions	MAPE (%)
0	Good	5.0
1	Poor	31.8

Good predictions are off by 5.0% on average; poor predictions by 31.8%.

Choosing the Right Metric#

Metric	Formula	Use when…
MSE	\(\frac{1}{n}\Sigma(y-\hat{y})^2\)	Optimising models; large errors are especially bad
RMSE	\(\sqrt{\text{MSE}}\)	Reporting error in original units; standard choice
MAE	\(\frac{1}{n}\Sigma\|y-\hat{y}\|\)	Data has outliers; all errors equally important
Huber	Piecewise MSE/MAE	Training when outliers exist but smooth gradients needed
R²	\(1 - SS_{res}/SS_{tot}\)	Communicating how much variance the model explains
MAPE	\(\frac{100\%}{n}\Sigma\|e_i/y_i\|\)	Scale-free comparison; stakeholder reporting

Tip

A large gap between RMSE and MAE signals the presence of outliers. Investigate before choosing which metric to optimise.

Metrics and Loss Functions

Contents

6.2.1.1. Metrics and Loss Functions#

Loss Functions#

Mean Squared Error (MSE)#

Root Mean Squared Error (RMSE)#

Mean Absolute Error (MAE)#

Huber Loss#

Evaluation Metrics#

R² - Coefficient of Determination#

RMSE vs MAE - When to Use Which#

MAPE - Mean Absolute Percentage Error#

Choosing the Right Metric#