Support Vector Machines

6.2.2.6. Support Vector Machines#

A Support Vector Machine (SVM) finds the maximum-margin hyperplane - the decision boundary that is as far as possible from the nearest training example of each class. Those critical training examples are called support vectors, and they alone determine the boundary; all other training points are irrelevant.

The parallel with Support Vector Regression is direct: SVR places an \(\varepsilon\)-tube around the regression function; SVM places a margin around the classification boundary and maximises its width. Both use the kernel trick to extend the approach to non-linear problems without explicitly computing a high-dimensional feature transformation.

SVMs excel when:

The number of features is large relative to the number of samples
There is a clear margin of separation in the data
A non-linear kernel is appropriate for the problem geometry

The Math#

For binary hard-margin SVM (linearly separable data), the optimal hyperplane \(\mathbf{w}^\top\mathbf{x} + b = 0\) solves:

\[\min_{\mathbf{w}, b}\; \frac{1}{2}\|\mathbf{w}\|^2 \qquad \text{subject to}\quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1\;\;\forall i\]

The margin width is \(\frac{2}{\|\mathbf{w}\|}\), so minimising \(\|\mathbf{w}\|^2\) is equivalent to maximising the margin.

Soft-margin SVM (allows some misclassification) introduces slack variables \(\xi_i \geq 0\):

\[\min_{\mathbf{w}, b, \xi}\; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i}\xi_i \qquad \text{subject to}\quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i\]

The RBF kernel maps inputs into an infinite-dimensional space, enabling non-linear decision boundaries:

\[K(\mathbf{x}, \mathbf{x}') = \exp\!\left(-\gamma\|\mathbf{x} - \mathbf{x}'\|^2\right)\]

Key hyperparameters:

Hyperparameter	Role
`C`	Regularisation - high C → narrow margin, fewer errors on training; low C → wider margin, more tolerant
`kernel`	Feature transformation: `'linear'`, `'rbf'`, `'poly'`
`gamma`	RBF bandwidth - high → complex boundary, risk of overfitting

In scikit-learn#

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

svm = Pipeline([
    ('scaler', StandardScaler()),
    ('svc',    SVC(kernel='rbf', C=10, gamma='scale',
                   probability=True, random_state=42))
])
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
y_prob = svm.predict_proba(X_test)[:, 1]

Feature scaling is essential - SVM optimises a distance-based objective and is sensitive to feature magnitudes. Set probability=True to enable probability estimates (uses Platt scaling, which adds overhead).

Example#

from sklearn.svm import SVC

svc = SVC(kernel='rbf', C=10, gamma='scale', probability=True, random_state=42)
svc.fit(X_train_sc, y_train)

train_acc = accuracy_score(y_train, svc.predict(X_train_sc))
test_acc  = accuracy_score(y_test,  svc.predict(X_test_sc))
test_auc  = roc_auc_score(y_test,   svc.predict_proba(X_test_sc)[:, 1])

The RBF-kernel SVM achieves a test accuracy of 0.979 and AUC-ROC of 0.996. Train accuracy 0.993 is close, confirming good generalisation. Only a small subset of training points become support vectors - the rest are irrelevant to the decision boundary.

Effect of C and gamma on the Decision Boundary#

from sklearn.model_selection import cross_val_score

configs = [
    {"C": 0.1,  "gamma": "scale", "label": "C=0.1 (under-regularised)"},
    {"C": 1.0,  "gamma": "scale", "label": "C=1.0"},
    {"C": 10,   "gamma": "scale", "label": "C=10  (default)"},
    {"C": 100,  "gamma": "scale", "label": "C=100 (over-fitted)"},
    {"C": 10,   "gamma": 0.001,   "label": "C=10, gamma=0.001 (smooth)"},
    {"C": 10,   "gamma": 1.0,     "label": "C=10, gamma=1.0  (jagged)"},
]

rows = []
for cfg in configs:
    m = SVC(kernel='rbf', C=cfg["C"], gamma=cfg["gamma"],
            probability=True, random_state=42)
    m.fit(X_train_sc, y_train)
    rows.append({
        "Config":          cfg["label"],
        "Train Accuracy":  round(accuracy_score(y_train, m.predict(X_train_sc)), 3),
        "Test Accuracy":   round(accuracy_score(y_test,  m.predict(X_test_sc)),  3),
        "Test AUC":        round(roc_auc_score(y_test,   m.predict_proba(X_test_sc)[:, 1]), 3),
    })

pd.DataFrame(rows)

	Config	Train Accuracy	Test Accuracy	Test AUC
0	C=0.1 (under-regularised)	0.962	0.937	0.989
1	C=1.0	0.979	0.979	0.997
2	C=10 (default)	0.993	0.979	0.996
3	C=100 (over-fitted)	1.000	0.944	0.990
4	C=10, gamma=0.001 (smooth)	0.979	0.979	0.996
5	C=10, gamma=1.0 (jagged)	1.000	0.636	0.959

High \(C\) or high \(\gamma\) creates overly complex boundaries that memorise the training set. The optimal region balances margin width and classification error.