6.3.3. Anomaly Detection#

Most data points in a dataset are normal. A small fraction deviate significantly from the rest — these are anomalies (also called outliers). Finding them matters enormously in practice:

  • Fraudulent credit card transactions among millions of legitimate ones

  • Faulty sensors in an industrial monitoring system

  • Rare disease cases in a medical dataset

Unlike clustering or dimensionality reduction, anomaly detection does not group points — it separates the standard from the strange. Because anomalies are rare and often unlabelled, most practical methods are unsupervised.


6.3.3.1. The Core Idea#

The unifying principle across almost all anomaly detection methods is:

Normal points live in dense, well-connected regions of the feature space. Anomalies are isolated, sparse, or geometrically extreme.

Methods differ in how they measure “isolation”:

Method

How it measures isolation

Z-Score / IQR

Distance from the mean / median in standard deviation units

Isolation Forest

Anomalies are easier to isolate with random cuts in the feature space

Local Outlier Factor (LOF)

A point’s density compared to its neighbours


6.3.3.2. Isolation Forest#

Isolation Forest is the most practical general-purpose algorithm for tabular data. The core insight is counterintuitive: anomalies are easier to isolate than normal points.

Build a random tree by repeatedly choosing a random feature and a random split value. Normal points, hidden deep in dense regions, require many splits to be isolated. Anomalies, being extreme or sparse, get isolated in just a few splits.

The anomaly score is the average depth across many trees:

\[s(\mathbf{x}, n) = 2^{-\frac{\mathbb{E}[h(\mathbf{x})]}{c(n)}}\]

where \(h(\mathbf{x})\) is the average isolation path length and \(c(n)\) is the expected path length for a random point in a dataset of size \(n\). A score near 1 indicates a likely anomaly; a score near 0.5 indicates normality.

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.05, random_state=42)
iso.fit(X_train)
scores  = iso.decision_function(X_test)  # higher → more normal
labels  = iso.predict(X_test)             # +1 = normal, -1 = anomaly

6.3.3.3. Local Outlier Factor#

LOF compares the local density of a point to the densities of its \(k\) nearest neighbours. If a point is much less dense than its neighbours, it is likely an outlier.

\[\text{LOF}_k(\mathbf{x}) = \frac{\text{mean local density of neighbours}}{\text{local density of } \mathbf{x}}\]

LOF \(\approx 1\) → normal. LOF \(\gg 1\) → anomaly.

LOF is particularly good at detecting contextual anomalies — points that are technically within the range of the data as a whole but are anomalous relative to their local neighbourhood.

from sklearn.neighbors import LocalOutlierFactor

lof    = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
labels = lof.fit_predict(X)   # +1 = normal, -1 = anomaly
scores = -lof.negative_outlier_factor_

6.3.3.4. Example#

Hide code cell source

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

np.random.seed(42)

# Normal data + injected anomalies
X_normal, _ = make_blobs(n_samples=300, centers=2,
                          cluster_std=0.6, random_state=42)
X_outliers  = np.random.uniform(-6, 6, size=(20, 2))
X           = np.vstack([X_normal, X_outliers])
y_true      = np.array([1]*300 + [-1]*20)

scaler = StandardScaler()
X_sc   = scaler.fit_transform(X)
iso = IsolationForest(contamination=0.06, random_state=42)
iso_labels = iso.fit_predict(X_sc)

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.06)
lof_labels = lof.fit_predict(X_sc)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
datasets  = [
    (y_true,     'True Labels'),
    (iso_labels, 'Isolation Forest'),
    (lof_labels, 'Local Outlier Factor'),
]

for ax, (labels, title) in zip(axes, datasets):
    colors = np.where(labels == -1, 'tomato', 'steelblue')
    ax.scatter(X_sc[:, 0], X_sc[:, 1], c=colors,
               edgecolors='k', linewidths=0.4, s=40, alpha=0.8)
    n_detected = (labels == -1).sum()
    ax.set_title(f'{title}\n(detected anomalies = {n_detected})',
                 fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)
    # Legend
    from matplotlib.patches import Patch
    ax.legend(handles=[Patch(color='steelblue', label='Normal'),
                        Patch(color='tomato', label='Anomaly')],
              fontsize=9)

plt.suptitle('Anomaly Detection Methods Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Accuracy
from sklearn.metrics import f1_score
for name, labels in [('Isolation Forest', iso_labels), ('LOF', lof_labels)]:
    f1 = f1_score(y_true, labels, pos_label=-1)
    prec = (labels[y_true == -1] == -1).mean()
    rec  = (labels[y_true == -1] == -1).mean()
    print(f"{name:20s}  Anomaly F1 = {f1:.2f}  |  Precision = {prec:.2f}")
../../../_images/08aab901851f4b40cde86954e2ce8148dd8a389acee113c4cb9d77c66dfba0a3.png
Isolation Forest      Anomaly F1 = 0.85  |  Precision = 0.85
LOF                   Anomaly F1 = 0.90  |  Precision = 0.90

Both methods recover most of the injected anomalies. The remaining misclassified points are either anomalies that landed inside a dense cluster by chance, or normal points near the edges of their clusters that look isolated.


6.3.3.5. Choosing a Method#

Situation

Recommendation

General tabular data

Isolation Forest — fast, scalable, robust

Anomalies relative to local neighbourhood

LOF

Univariate feature, Gaussian distribution

Z-Score

Explicit covariance structure known

Elliptic Envelope

Tip

The contamination parameter controls the fraction of data the model labels as anomalies. Setting it too low means anomalies go undetected; too high means many normal points get flagged. Calibrate it using domain knowledge about the expected anomaly rate.