6.2.3.3. Neural Networks#

A single The Perceptron can only learn linearly separable patterns. The moment you add one or more hidden layers between input and output, the network can represent any continuous function - a remarkable result known as the Universal Approximation Theorem.

This architecture - Input Layer → Hidden Layer(s) → Output Layer - is called a Multi-Layer Perceptron (MLP) or feedforward neural network. It is the foundational architecture for all of modern deep learning.


The Core Idea#

Hidden layers transform the input step by step. Each layer learns a new representation of the data - increasingly abstract features built on top of previous ones:

  • Layer 1: learns simple combinations of raw features

  • Layer 2: learns combinations of Layer 1’s outputs

  • Layer L: hands a transformed representation to the output layer

The output layer then applies a task-appropriate activation:

  • Regression: no activation (linear output)

  • Binary classification: sigmoid — squashes a single score into a probability in \((0,1)\)

  • Multi-class: softmax — converts \(K\) raw scores into a probability distribution that sums to 1

The hidden layers nearly always use ReLU (or a variant). The output activation is chosen to match the task — sigmoid for binary, softmax for multi-class, nothing for regression.


The Math#

Forward Pass#

For a network with \(L\) layers, each layer \(\ell\) transforms its input \(\mathbf{a}^{(\ell-1)}\) (where \(\mathbf{a}^{(0)} = \mathbf{x}\)):

\[\mathbf{z}^{(\ell)} = W^{(\ell)}\,\mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}\]
\[\mathbf{a}^{(\ell)} = f\!\left(\mathbf{z}^{(\ell)}\right)\]

where \(W^{(\ell)}\) is the weight matrix for layer \(\ell\) and \(f\) is the activation function applied element-wise.

Output Activations#

Sigmoid (binary classification output)

The sigmoid function takes a single raw score \(z\) and maps it to a probability:

\[\sigma(z) = \frac{1}{1 + e^{-z}} \in (0, 1)\]

The network outputs one neuron; if \(\sigma(z) \geq 0.5\) the prediction is class 1, otherwise class 0. The loss is binary cross-entropy:

\[\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]\]

Softmax (multi-class classification output)

For \(K\) classes the network outputs \(K\) neurons with raw scores \(z_1, \ldots, z_K\). Softmax converts them into a probability distribution:

\[\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad \sum_{k=1}^{K}\text{softmax}(\mathbf{z})_k = 1\]

Taking the exponential ensures all outputs are positive; dividing by the sum normalises them. The class with the highest score still gets the largest probability — softmax preserves ordering while making the outputs interpretable as probabilities.

The loss minimised during training is categorical cross-entropy:

\[\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} \mathbf{1}[y_i=k]\,\log \hat{p}_{ik}\]

Because only the true-class term survives the indicator \(\mathbf{1}[y_i=k]\), this simplifies to: minimise the negative log-probability of the correct class — the higher the probability assigned to the right answer, the lower the loss.

Backpropagation#

Backpropagation is the algorithm that computes \(\nabla_W \mathcal{L}\) efficiently by applying the chain rule backwards through the network:

\[\frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}} \cdot \left(\mathbf{a}^{(\ell-1)}\right)^\top\]

The gradient at each layer is computed from the gradient at the next layer, passed backwards - hence the name. This is just the chain rule of calculus, applied systematically.

Key hyperparameters:

Hyperparameter

Role

hidden_layer_sizes

Number of hidden layers and neurons per layer

activation

'relu' (default), 'tanh', 'logistic'

solver

Optimizer - see Optimizers

alpha

L2 regularisation strength

learning_rate_init

Starting step size

batch_size

Mini-batch size for SGD/Adam


In scikit-learn#

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

mlp = Pipeline([
    ('scaler', StandardScaler()),
    ('mlp', MLPClassifier(
        hidden_layer_sizes=(128, 64),  # two hidden layers
        activation='relu',
        solver='adam',
        alpha=1e-4,                    # L2 regularisation
        learning_rate_init=0.001,
        max_iter=300,
        random_state=42
    ))
])
mlp.fit(X_train, y_train)

Always scale features before training an MLP - gradient descent is highly sensitive to feature magnitudes. A Pipeline ensures the scaler is never accidentally applied to the test set during evaluation.


Example#

Hide code cell source

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from myst_nb import glue
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score, ConfusionMatrixDisplay, confusion_matrix

np.random.seed(42)

# Shared dataset used throughout all Neural Network pages
data = load_digits()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(128, 64),
    activation='relu',
    solver='adam',
    alpha=1e-4,
    learning_rate_init=0.001,
    max_iter=300,
    random_state=42
)
mlp.fit(X_train_sc, y_train)

train_acc = accuracy_score(y_train, mlp.predict(X_train_sc))
test_acc  = accuracy_score(y_test,  mlp.predict(X_test_sc))

Two hidden layers (128 → 64 → 10) achieves a test accuracy of 0.973, a dramatic improvement over the single-layer The Perceptron. Train accuracy of 1.0 is slightly higher but the gap is small, indicating the L2 regularisation is working.

Training Loss Curve#

Monitoring the loss curve is the first thing to do after training. A smooth descent that plateaus is healthy; oscillations suggest the learning rate is too high; a loss that barely moves suggests it is too low.

fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(mlp.loss_curve_, color='steelblue', lw=2)
ax.set_xlabel("Iterations")
ax.set_ylabel("Cross-entropy loss")
ax.set_title("MLP Training Loss Curve -  Digits (128→64 hidden units, ReLU, Adam)",
             fontweight='bold')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
../../../../_images/4f54a17cf04b5fce6b8fc840bd660dadbada87bc713fc900a6bc8ce5b40ed571.png

Confusion Matrix#

fig, ax = plt.subplots(figsize=(7, 6))
ConfusionMatrixDisplay.from_predictions(
    y_test, mlp.predict(X_test_sc),
    ax=ax, colorbar=False, cmap='Blues'
)
ax.set_title("MLP -  Confusion Matrix (Digits Test Set)", fontweight='bold')
plt.tight_layout()
plt.show()
../../../../_images/2ac90840928a6784bf3265be141e8963b259893814174fc73986dc3b7dad307a.png

Effect of Architecture Depth#

Adding hidden layers increases model capacity. Here we compare architectures from no hidden layers (a linear model) up to three hidden layers.

architectures = {
    "No hidden (linear)":   (),
    "1 layer (64)":         (64,),
    "2 layers (128, 64)":   (128, 64),
    "3 layers (256,128,64)":(256, 128, 64),
}

rows = []
for name, size in architectures.items():
    m = MLPClassifier(hidden_layer_sizes=size, activation='relu',
                      solver='adam', alpha=1e-4,
                      max_iter=400, random_state=42)
    m.fit(X_train_sc, y_train)
    rows.append({
        "Architecture":   name,
        "Train Accuracy": round(accuracy_score(y_train, m.predict(X_train_sc)), 3),
        "Test Accuracy":  round(accuracy_score(y_test,  m.predict(X_test_sc)),  3),
        "Iterations":     m.n_iter_,
    })

pd.DataFrame(rows)
/home/runner/work/datasciencethenovel/datasciencethenovel/.venv/lib/python3.13/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  warnings.warn(
Architecture Train Accuracy Test Accuracy Iterations
0 No hidden (linear) 0.997 0.964 400
1 1 layer (64) 1.000 0.980 164
2 2 layers (128, 64) 1.000 0.973 80
3 3 layers (256,128,64) 1.000 0.982 48

Each additional layer captures more abstract features from the digit images, translating into higher accuracy - until the model is large enough that we hit diminishing returns or would need regularisation to prevent overfitting.