6.3.2.3. UMAP: Uniform Manifold Approximation and Projection#

t-SNE produces beautiful cluster visualisations but comes with a practical cost: it is slow, non-deterministic, cannot project new points without re-fitting, and discards global structure. UMAP, published in 2018, addresses all four limitations simultaneously.

UMAP is grounded in Riemannian geometry and algebraic topology, but its practical behaviour is easy to summarise: it builds a graph that captures the local neighbourhood structure of the data, then finds a low-dimensional layout that preserves that graph as faithfully as possible — while also keeping large-scale structure intact far better than t-SNE.

In practice, UMAP is:

  • Much faster — typically 10–100× quicker than t-SNE on the same dataset.

  • More consistent — largely deterministic across runs (with a fixed seed).

  • Better at global structure — the relative positions of clusters carry more meaning.

  • Versatile — supports out-of-sample projection, making it usable inside ML pipelines.


The Intuition#

UMAP proceeds in two stages:

  1. Graph construction — build a fuzzy, weighted nearest-neighbour graph in the high-dimensional space. Each edge weight reflects how confident we are that two points are truly neighbours (it falls off smoothly with distance, and is calibrated so that each point has a roughly equal “local scale”).

  2. Layout optimisation — find a low-dimensional embedding that minimises the cross-entropy between the high-D fuzzy graph and a similarly constructed low-D graph. This is done via stochastic gradient descent, which is far faster than t-SNE’s force-directed approach.

The “fuzzy” definition of the graph is what allows UMAP to handle manifolds of varying density intelligently — a common failure mode for t-SNE.


The Math (at a glance)#

For each point \(\mathbf{x}_i\) and its \(k\)-nearest neighbour \(\mathbf{x}_{j}\), define a membership strength:

\[w_{ij} = \exp\!\left(-\frac{d(\mathbf{x}_i, \mathbf{x}_j) - \rho_i}{\sigma_i}\right)\]

where \(\rho_i\) is the distance to the nearest neighbour (ensures continuity of the manifold) and \(\sigma_i\) is calibrated to achieve a target “effective connectivity” of \(\log_2 k\).

The symmetrised graph weight is:

\[\bar{w}_{ij} = w_{ij} + w_{ji} - w_{ij} \cdot w_{ji}\]

In the low-D space, membership is modelled with a smooth family of curves:

\[q_{ij} = \left(1 + a\, \|\mathbf{y}_i - \mathbf{y}_j\|^{2b}\right)^{-1}\]

The parameters \(a\) and \(b\) are determined by min_dist. The embedding is optimised by minimising the binary cross-entropy:

\[\mathcal{L} = \sum_{(i,j) \in E} \left[ \bar{w}_{ij} \log q_{ij} + (1 - \bar{w}_{ij}) \log(1 - q_{ij}) \right]\]

In practice#

# Install: pip install umap-learn
import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap  = reducer.fit_transform(X_sc)

# Out-of-sample projection (unlike t-SNE, this works!)
X_new_umap = reducer.transform(X_new_sc)

Key parameters:

Parameter

Meaning

n_neighbors

Size of local neighbourhood; small → fine local structure; large → more global

min_dist

Minimum distance between points in the embedding; small → tighter clusters

n_components

Output dimensionality (2 for visualisation; higher for pre-processing)

metric

Distance metric ('euclidean', 'cosine', etc.)


Example#

Hide code cell source

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

np.random.seed(42)

data  = load_digits()
X, y  = data.data, data.target

scaler = StandardScaler()
X_sc   = scaler.fit_transform(X)

import umap as umap_lib

UMAP on the Digits Dataset#

import time
t0 = time.time()
reducer = umap_lib.UMAP(n_components=2, n_neighbors=15,
                        min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_sc)
print(f"UMAP runtime: {time.time() - t0:.1f} s")

fig, ax = plt.subplots(figsize=(10, 7))
sc = ax.scatter(X_umap[:, 0], X_umap[:, 1],
                c=y, cmap='tab10', alpha=0.75,
                edgecolors='k', linewidths=0.2, s=20)
cbar = plt.colorbar(sc, ax=ax, ticks=range(10))
cbar.set_label('Digit class', fontsize=11)
ax.set_xlabel('UMAP Dimension 1', fontsize=12)
ax.set_ylabel('UMAP Dimension 2', fontsize=12)
ax.set_title('UMAP Projection of Digits (64 → 2 dims)',
                fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
/home/runner/work/datasciencethenovel/datasciencethenovel/.venv/lib/python3.13/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
UMAP runtime: 11.0 s
../../../../_images/296779014913b88618be2ba5640803c57f92a35ef573bc2caf8e9f3a1c9ee318.png

Effect of n_neighbors#

neighbor_vals = [5, 15, 50]
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, nn in zip(axes, neighbor_vals):
    emb = umap_lib.UMAP(n_components=2, n_neighbors=nn,
                        min_dist=0.1, random_state=42).fit_transform(X_sc)
    ax.scatter(emb[:, 0], emb[:, 1], c=y, cmap='tab10',
                alpha=0.65, edgecolors='k', linewidths=0.15, s=16)
    ax.set_title(f'n_neighbors = {nn}', fontsize=12, fontweight='bold')
    ax.set_xticks([]); ax.set_yticks([])
    ax.grid(True, alpha=0.2)

plt.suptitle('UMAP n_neighbors Sensitivity — Digits Dataset',
                fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
/home/runner/work/datasciencethenovel/datasciencethenovel/.venv/lib/python3.13/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
/home/runner/work/datasciencethenovel/datasciencethenovel/.venv/lib/python3.13/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
/home/runner/work/datasciencethenovel/datasciencethenovel/.venv/lib/python3.13/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
../../../../_images/ab9c4ff9ef41f5966438405df853db5ce96a5da68e93622b135eade840fec54d.png

Small n_neighbors emphasises fine local clusters at the cost of global coherence. Large n_neighbors produces a more continuous landscape where the relative positions of digit classes become visible — something t-SNE rarely achieves.


Strengths and Weaknesses#

Strengths

Much faster than t-SNE; preserves global structure; supports out-of-sample projection; largely deterministic; works with non-Euclidean metrics

Weaknesses

External dependency (umap-learn, not in scikit-learn); sensitive to n_neighbors and min_dist; theoretical guarantees are approximate; interprets axes no more than t-SNE

Tip

UMAP is the preferred choice when you need a visualisation tool you can reuse (i.e., project new points onto an existing layout), or when your dataset is large enough that t-SNE’s \(O(n^2)\) cost is prohibitive.