6.2.2.1. Metrics and Loss Functions#

Before training any classifier, we need a way to measure: how good are its predictions? Unlike regression, where a real-valued error is natural, classification outputs discrete labels - so we need different tools.

The distinction between loss functions and evaluation metrics applies here too:

  • Loss functions drive optimisation during training (e.g. binary cross-entropy is differentiable and guides gradient descent).

  • Evaluation metrics communicate model quality after training in a task-meaningful way (e.g. precision and recall for imbalanced classes).

Hide code cell source

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from myst_nb import glue
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, roc_curve, precision_recall_curve,
    ConfusionMatrixDisplay, log_loss
)

np.random.seed(42)

# Shared dataset used across all Classification pages
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# Fit a simple model to get predictions for demonstration
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(X_train_sc, y_train)
y_pred      = clf.predict(X_test_sc)
y_prob      = clf.predict_proba(X_test_sc)[:, 1]

Loss Functions#

Binary Cross-Entropy (Log Loss)#

The standard loss for binary classification is binary cross-entropy, also called log loss. It measures the average negative log-probability assigned to the correct class:

\[\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log \hat{p}_i + (1 - y_i)\log(1 - \hat{p}_i)\right]\]

where \(\hat{p}_i = P(\hat{y}_i = 1 | \mathbf{x}_i)\) is the predicted probability for the positive class.

  • A perfect model assigning probability 1 to the correct class every time yields \(\mathcal{L} = 0\).

  • A model that is confidently wrong is penalised very heavily because \(-\log(\varepsilon) \to \infty\) as \(\varepsilon \to 0\).

logloss = log_loss(y_test, y_prob)

The model achieves a log loss of 0.0679 on the held-out test set.

Categorical Cross-Entropy#

For \(K\)-class problems the loss generalises to:

\[\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} \mathbf{1}[y_i = k]\,\log \hat{p}_{ik}\]

Evaluation Metrics#

The Confusion Matrix#

Everything starts here. For binary classification, the confusion matrix counts the four possible outcomes:

Predicted Positive

Predicted Negative

Actual Positive

TP (True Positive)

FN (False Negative)

Actual Negative

FP (False Positive)

TN (True Negative)

cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(5, 4))
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=data.target_names)
disp.plot(ax=ax, colorbar=False, cmap='Blues')
ax.set_title("Confusion Matrix - Breast Cancer Test Set",
             fontsize=12, fontweight='bold', pad=10)
plt.tight_layout()
plt.show()
../../../../_images/a02ed160d629d79b89bee12c72c1349a76cba202c35b11510d2064dbeb2db64e.png

Accuracy#

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

The most intuitive metric, but misleading on imbalanced datasets. A model that always predicts the majority class achieves high accuracy without learning anything useful.

acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.3f}")
Accuracy: 0.986

Precision and Recall#

These two metrics capture the trade-off between false positives and false negatives - a balance that is crucial in most real-world problems.

\[\text{Precision} = \frac{TP}{TP + FP} \qquad \text{Recall} = \frac{TP}{TP + FN}\]
  • Precision: Of all positive predictions, how many were correct? Relevant when false positives are costly (e.g. flagging legitimate emails as spam).

  • Recall (Sensitivity): Of all actual positives, how many did we catch? Relevant when false negatives are costly (e.g. missing a cancer diagnosis).

prec   = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print(f"Precision: {prec:.3f}")
print(f"Recall:    {recall:.3f}")
Precision: 0.989
Recall:    0.989

F1 Score#

The F1 score is the harmonic mean of precision and recall. It gives a single number that balances both concerns, penalising extreme imbalances between them:

\[F_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.3f}")
F1 Score: 0.989

ROC Curve and AUC#

The ROC (Receiver Operating Characteristic) curve sweeps the decision threshold from 0 to 1 and plots the True Positive Rate (Recall) against the False Positive Rate (1 − Specificity). A perfect classifier hugs the top-left corner; a random classifier lies on the diagonal.

The area under the ROC curve (AUC-ROC) summarises this into a single number in \([0, 1]\):

  • AUC = 1.0: perfect separation

  • AUC = 0.5: no better than random chance

\[\text{FPR} = \frac{FP}{FP + TN} \qquad \text{TPR} = \frac{TP}{TP + FN}\]
auc = roc_auc_score(y_test, y_prob)
fpr, tpr, _ = roc_curve(y_test, y_prob)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# ROC Curve
axes[0].plot(fpr, tpr, color='steelblue', lw=2,
             label=f'ROC curve (AUC = {auc:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', lw=1, label='Random (AUC = 0.5)')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate (Recall)')
axes[0].set_title('ROC Curve', fontweight='bold')
axes[0].legend(loc='lower right')
axes[0].grid(True, alpha=0.3)

# Precision-Recall Curve
prec_arr, rec_arr, _ = precision_recall_curve(y_test, y_prob)
axes[1].plot(rec_arr, prec_arr, color='darkorange', lw=2)
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve', fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"AUC-ROC: {auc:.3f}")
../../../../_images/831d91af9effa22da55d555494192f44a8424d2afb6163fc659df189c37ed483.png
AUC-ROC: 0.998

Key takeaway: No single metric tells the whole story. Use accuracy for rough benchmarking, F1 when classes are imbalanced, AUC-ROC to compare models across thresholds, and always inspect the confusion matrix to understand the failure mode.