6.2.2.6. Support Vector Machines#

A Support Vector Machine (SVM) finds the maximum-margin hyperplane - the decision boundary that is as far as possible from the nearest training example of each class. Those critical training examples are called support vectors, and they alone determine the boundary; all other training points are irrelevant.

The parallel with Support Vector Regression is direct: SVR places an \(\varepsilon\)-tube around the regression function; SVM places a margin around the classification boundary and maximises its width. Both use the kernel trick to extend the approach to non-linear problems without explicitly computing a high-dimensional feature transformation.

SVMs excel when:

  • The number of features is large relative to the number of samples

  • There is a clear margin of separation in the data

  • A non-linear kernel is appropriate for the problem geometry


The Math#

For binary hard-margin SVM (linearly separable data), the optimal hyperplane \(\mathbf{w}^\top\mathbf{x} + b = 0\) solves:

\[\min_{\mathbf{w}, b}\; \frac{1}{2}\|\mathbf{w}\|^2 \qquad \text{subject to}\quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1\;\;\forall i\]

The margin width is \(\frac{2}{\|\mathbf{w}\|}\), so minimising \(\|\mathbf{w}\|^2\) is equivalent to maximising the margin.

Soft-margin SVM (allows some misclassification) introduces slack variables \(\xi_i \geq 0\):

\[\min_{\mathbf{w}, b, \xi}\; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i}\xi_i \qquad \text{subject to}\quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i\]

The RBF kernel maps inputs into an infinite-dimensional space, enabling non-linear decision boundaries:

\[K(\mathbf{x}, \mathbf{x}') = \exp\!\left(-\gamma\|\mathbf{x} - \mathbf{x}'\|^2\right)\]

Key hyperparameters:

Hyperparameter

Role

C

Regularisation - high C → narrow margin, fewer errors on training; low C → wider margin, more tolerant

kernel

Feature transformation: 'linear', 'rbf', 'poly'

gamma

RBF bandwidth - high → complex boundary, risk of overfitting


In scikit-learn#

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

svm = Pipeline([
    ('scaler', StandardScaler()),
    ('svc',    SVC(kernel='rbf', C=10, gamma='scale',
                   probability=True, random_state=42))
])
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
y_prob = svm.predict_proba(X_test)[:, 1]

Feature scaling is essential - SVM optimises a distance-based objective and is sensitive to feature magnitudes. Set probability=True to enable probability estimates (uses Platt scaling, which adds overhead).


Example#

Hide code cell source

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from myst_nb import glue
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

np.random.seed(42)

# Shared dataset used throughout all Classification Algorithm pages
data = load_breast_cancer()
X, y = data.data, data.target   # 0 = malignant, 1 = benign
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)
from sklearn.svm import SVC

svc = SVC(kernel='rbf', C=10, gamma='scale', probability=True, random_state=42)
svc.fit(X_train_sc, y_train)

train_acc = accuracy_score(y_train, svc.predict(X_train_sc))
test_acc  = accuracy_score(y_test,  svc.predict(X_test_sc))
test_auc  = roc_auc_score(y_test,   svc.predict_proba(X_test_sc)[:, 1])

The RBF-kernel SVM achieves a test accuracy of 0.979 and AUC-ROC of 0.996. Train accuracy 0.993 is close, confirming good generalisation. Only a small subset of training points become support vectors - the rest are irrelevant to the decision boundary.

Effect of C and gamma on the Decision Boundary#

from sklearn.model_selection import cross_val_score

configs = [
    {"C": 0.1,  "gamma": "scale", "label": "C=0.1 (under-regularised)"},
    {"C": 1.0,  "gamma": "scale", "label": "C=1.0"},
    {"C": 10,   "gamma": "scale", "label": "C=10  (default)"},
    {"C": 100,  "gamma": "scale", "label": "C=100 (over-fitted)"},
    {"C": 10,   "gamma": 0.001,   "label": "C=10, gamma=0.001 (smooth)"},
    {"C": 10,   "gamma": 1.0,     "label": "C=10, gamma=1.0  (jagged)"},
]

rows = []
for cfg in configs:
    m = SVC(kernel='rbf', C=cfg["C"], gamma=cfg["gamma"],
            probability=True, random_state=42)
    m.fit(X_train_sc, y_train)
    rows.append({
        "Config":          cfg["label"],
        "Train Accuracy":  round(accuracy_score(y_train, m.predict(X_train_sc)), 3),
        "Test Accuracy":   round(accuracy_score(y_test,  m.predict(X_test_sc)),  3),
        "Test AUC":        round(roc_auc_score(y_test,   m.predict_proba(X_test_sc)[:, 1]), 3),
    })

pd.DataFrame(rows)
Config Train Accuracy Test Accuracy Test AUC
0 C=0.1 (under-regularised) 0.962 0.937 0.989
1 C=1.0 0.979 0.979 0.997
2 C=10 (default) 0.993 0.979 0.996
3 C=100 (over-fitted) 1.000 0.944 0.990
4 C=10, gamma=0.001 (smooth) 0.979 0.979 0.996
5 C=10, gamma=1.0 (jagged) 1.000 0.636 0.959

High \(C\) or high \(\gamma\) creates overly complex boundaries that memorise the training set. The optimal region balances margin width and classification error.