6.2.2.6. Support Vector Machines#
A Support Vector Machine (SVM) finds the maximum-margin hyperplane - the decision boundary that is as far as possible from the nearest training example of each class. Those critical training examples are called support vectors, and they alone determine the boundary; all other training points are irrelevant.
The parallel with Support Vector Regression is direct: SVR places an \(\varepsilon\)-tube around the regression function; SVM places a margin around the classification boundary and maximises its width. Both use the kernel trick to extend the approach to non-linear problems without explicitly computing a high-dimensional feature transformation.
SVMs excel when:
The number of features is large relative to the number of samples
There is a clear margin of separation in the data
A non-linear kernel is appropriate for the problem geometry
The Math#
For binary hard-margin SVM (linearly separable data), the optimal hyperplane \(\mathbf{w}^\top\mathbf{x} + b = 0\) solves:
The margin width is \(\frac{2}{\|\mathbf{w}\|}\), so minimising \(\|\mathbf{w}\|^2\) is equivalent to maximising the margin.
Soft-margin SVM (allows some misclassification) introduces slack variables \(\xi_i \geq 0\):
The RBF kernel maps inputs into an infinite-dimensional space, enabling non-linear decision boundaries:
Key hyperparameters:
Hyperparameter |
Role |
|---|---|
|
Regularisation - high C → narrow margin, fewer errors on training; low C → wider margin, more tolerant |
|
Feature transformation: |
|
RBF bandwidth - high → complex boundary, risk of overfitting |
In scikit-learn#
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
svm = Pipeline([
('scaler', StandardScaler()),
('svc', SVC(kernel='rbf', C=10, gamma='scale',
probability=True, random_state=42))
])
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
y_prob = svm.predict_proba(X_test)[:, 1]
Feature scaling is essential - SVM optimises a distance-based objective and is sensitive to feature magnitudes. Set probability=True to enable probability estimates (uses Platt scaling, which adds overhead).
Example#
from sklearn.svm import SVC
svc = SVC(kernel='rbf', C=10, gamma='scale', probability=True, random_state=42)
svc.fit(X_train_sc, y_train)
train_acc = accuracy_score(y_train, svc.predict(X_train_sc))
test_acc = accuracy_score(y_test, svc.predict(X_test_sc))
test_auc = roc_auc_score(y_test, svc.predict_proba(X_test_sc)[:, 1])
The RBF-kernel SVM achieves a test accuracy of 0.979 and AUC-ROC of 0.996. Train accuracy 0.993 is close, confirming good generalisation. Only a small subset of training points become support vectors - the rest are irrelevant to the decision boundary.
Effect of C and gamma on the Decision Boundary#
from sklearn.model_selection import cross_val_score
configs = [
{"C": 0.1, "gamma": "scale", "label": "C=0.1 (under-regularised)"},
{"C": 1.0, "gamma": "scale", "label": "C=1.0"},
{"C": 10, "gamma": "scale", "label": "C=10 (default)"},
{"C": 100, "gamma": "scale", "label": "C=100 (over-fitted)"},
{"C": 10, "gamma": 0.001, "label": "C=10, gamma=0.001 (smooth)"},
{"C": 10, "gamma": 1.0, "label": "C=10, gamma=1.0 (jagged)"},
]
rows = []
for cfg in configs:
m = SVC(kernel='rbf', C=cfg["C"], gamma=cfg["gamma"],
probability=True, random_state=42)
m.fit(X_train_sc, y_train)
rows.append({
"Config": cfg["label"],
"Train Accuracy": round(accuracy_score(y_train, m.predict(X_train_sc)), 3),
"Test Accuracy": round(accuracy_score(y_test, m.predict(X_test_sc)), 3),
"Test AUC": round(roc_auc_score(y_test, m.predict_proba(X_test_sc)[:, 1]), 3),
})
pd.DataFrame(rows)
| Config | Train Accuracy | Test Accuracy | Test AUC | |
|---|---|---|---|---|
| 0 | C=0.1 (under-regularised) | 0.962 | 0.937 | 0.989 |
| 1 | C=1.0 | 0.979 | 0.979 | 0.997 |
| 2 | C=10 (default) | 0.993 | 0.979 | 0.996 |
| 3 | C=100 (over-fitted) | 1.000 | 0.944 | 0.990 |
| 4 | C=10, gamma=0.001 (smooth) | 0.979 | 0.979 | 0.996 |
| 5 | C=10, gamma=1.0 (jagged) | 1.000 | 0.636 | 0.959 |
High \(C\) or high \(\gamma\) creates overly complex boundaries that memorise the training set. The optimal region balances margin width and classification error.