Anomaly Detection

6.3.3. Anomaly Detection#

Most data points in a dataset are normal. A small fraction deviate significantly from the rest — these are anomalies (also called outliers). Finding them matters enormously in practice:

Fraudulent credit card transactions among millions of legitimate ones
Faulty sensors in an industrial monitoring system
Rare disease cases in a medical dataset

Unlike clustering or dimensionality reduction, anomaly detection does not group points — it separates the standard from the strange. Because anomalies are rare and often unlabelled, most practical methods are unsupervised.

6.3.3.1. The Core Idea#

The unifying principle across almost all anomaly detection methods is:

Normal points live in dense, well-connected regions of the feature space. Anomalies are isolated, sparse, or geometrically extreme.

Methods differ in how they measure “isolation”:

Method	How it measures isolation
Z-Score / IQR	Distance from the mean / median in standard deviation units
Isolation Forest	Anomalies are easier to isolate with random cuts in the feature space
Local Outlier Factor (LOF)	A point’s density compared to its neighbours

6.3.3.2. Isolation Forest#

Isolation Forest is the most practical general-purpose algorithm for tabular data. The core insight is counterintuitive: anomalies are easier to isolate than normal points.

Build a random tree by repeatedly choosing a random feature and a random split value. Normal points, hidden deep in dense regions, require many splits to be isolated. Anomalies, being extreme or sparse, get isolated in just a few splits.

The anomaly score is the average depth across many trees:

\[s(\mathbf{x}, n) = 2^{-\frac{\mathbb{E}[h(\mathbf{x})]}{c(n)}}\]

where \(h(\mathbf{x})\) is the average isolation path length and \(c(n)\) is the expected path length for a random point in a dataset of size \(n\). A score near 1 indicates a likely anomaly; a score near 0.5 indicates normality.

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.05, random_state=42)
iso.fit(X_train)
scores  = iso.decision_function(X_test)  # higher → more normal
labels  = iso.predict(X_test)             # +1 = normal, -1 = anomaly

6.3.3.3. Local Outlier Factor#

LOF compares the local density of a point to the densities of its \(k\) nearest neighbours. If a point is much less dense than its neighbours, it is likely an outlier.

\[\text{LOF}_k(\mathbf{x}) = \frac{\text{mean local density of neighbours}}{\text{local density of } \mathbf{x}}\]

LOF \(\approx 1\) → normal. LOF \(\gg 1\) → anomaly.

LOF is particularly good at detecting contextual anomalies — points that are technically within the range of the data as a whole but are anomalous relative to their local neighbourhood.

from sklearn.neighbors import LocalOutlierFactor

lof    = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
labels = lof.fit_predict(X)   # +1 = normal, -1 = anomaly
scores = -lof.negative_outlier_factor_

6.3.3.4. Example#

iso = IsolationForest(contamination=0.06, random_state=42)
iso_labels = iso.fit_predict(X_sc)

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.06)
lof_labels = lof.fit_predict(X_sc)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
datasets  = [
    (y_true,     'True Labels'),
    (iso_labels, 'Isolation Forest'),
    (lof_labels, 'Local Outlier Factor'),
]

for ax, (labels, title) in zip(axes, datasets):
    colors = np.where(labels == -1, 'tomato', 'steelblue')
    ax.scatter(X_sc[:, 0], X_sc[:, 1], c=colors,
               edgecolors='k', linewidths=0.4, s=40, alpha=0.8)
    n_detected = (labels == -1).sum()
    ax.set_title(f'{title}\n(detected anomalies = {n_detected})',
                 fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)
    # Legend
    from matplotlib.patches import Patch
    ax.legend(handles=[Patch(color='steelblue', label='Normal'),
                        Patch(color='tomato', label='Anomaly')],
              fontsize=9)

plt.suptitle('Anomaly Detection Methods Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Accuracy
from sklearn.metrics import f1_score
for name, labels in [('Isolation Forest', iso_labels), ('LOF', lof_labels)]:
    f1 = f1_score(y_true, labels, pos_label=-1)
    prec = (labels[y_true == -1] == -1).mean()
    rec  = (labels[y_true == -1] == -1).mean()
    print(f"{name:20s}  Anomaly F1 = {f1:.2f}  |  Precision = {prec:.2f}")

../../../_images/08aab901851f4b40cde86954e2ce8148dd8a389acee113c4cb9d77c66dfba0a3.png

Isolation Forest      Anomaly F1 = 0.85  |  Precision = 0.85
LOF                   Anomaly F1 = 0.90  |  Precision = 0.90

Both methods recover most of the injected anomalies. The remaining misclassified points are either anomalies that landed inside a dense cluster by chance, or normal points near the edges of their clusters that look isolated.

6.3.3.5. Choosing a Method#

Situation	Recommendation
General tabular data	Isolation Forest — fast, scalable, robust
Anomalies relative to local neighbourhood	LOF
Univariate feature, Gaussian distribution	Z-Score
Explicit covariance structure known	Elliptic Envelope

Tip

The contamination parameter controls the fraction of data the model labels as anomalies. Setting it too low means anomalies go undetected; too high means many normal points get flagged. Calibrate it using domain knowledge about the expected anomaly rate.