Hierarchical Clustering

6.3.1.2. Hierarchical Clustering#

K-Means forces you to choose the number of clusters before you see the data. Hierarchical clustering takes a different approach: it builds a complete tree of groupings — from each point in its own cluster all the way up to one single cluster — and lets you decide where to cut.

The result is a dendrogram: a tree diagram that shows, at a glance, how similar groups merge and at what scale structure appears. This makes hierarchical clustering the preferred tool when you want to explore your data before committing to a particular $k$.

The Intuition#

The most common variant is agglomerative (bottom-up) clustering:

Start with $n$ clusters — one per data point.
Find the two closest clusters.
Merge them into a single cluster.
Repeat steps 2–3 until one cluster remains.

At each step the algorithm records which clusters merged and at what distance. That history is the dendrogram. To obtain any desired number of clusters, you simply draw a horizontal line across the dendrogram and count how many vertical lines it cuts.

The Math#

The one degree of freedom is how you measure the distance between two clusters (not individual points). This is the linkage criterion:

Linkage	Distance between clusters $A$ and $B$	Behaviour
Single	$\min_{a \in A, b \in B} d(a, b)$	Chaining effect; finds elongated clusters
Complete	$\max_{a \in A, b \in B} d(a, b)$	Compact clusters; sensitive to outliers
Average	$\frac{1}{	A
Ward	Increase in total WCSS after merge	Minimises variance; produces similar-sized clusters

Ward’s linkage is usually the best default. It produces compact, similarly-sized clusters and aligns most closely with K-Means’ objective.

The agglomerative process has time complexity $O(n^2 \log n)$ and space complexity $O(n^2)$, making it impractical for very large datasets (> ~10 000 points).

In scikit-learn#

from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = model.fit_predict(X_sc)

To visualise the dendrogram, use scipy:

from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(X_sc, method='ward')
dendrogram(Z)

Key parameters:

Parameter	Meaning
`n_clusters`	Number of clusters to cut the tree at
`linkage`	`'ward'`, `'complete'`, `'average'`, `'single'`
`metric`	Distance metric (default `'euclidean'`; not used with `'ward'`)

Example#

Reading the Dendrogram#

Z = linkage(X_sc, method='ward')

fig, ax = plt.subplots(figsize=(14, 5))
dendrogram(Z, ax=ax, truncate_mode='lastp', p=20,
           leaf_rotation=45, leaf_font_size=10,
           color_threshold=6.5, above_threshold_color='grey')
ax.axhline(y=6.5, color='tomato', linestyle='--', linewidth=1.5, label='Cut at 3 clusters')
ax.set_xlabel('Sample index (or cluster size)', fontsize=12)
ax.set_ylabel('Ward distance', fontsize=12)
ax.set_title('Dendrogram — Iris Dataset (Ward linkage)', fontsize=13, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

../../../../_images/b39793f719b14ab13036385cbaf28cee8508b2b6fb9610e46a89e731507cd3b7.png

The dendrogram shows a clear split into three main branches. The large vertical jump just below the red line tells us the three-cluster solution has exceptionally high between-cluster distance — exactly where you want to cut.

Fitted Clusters vs True Labels#

from sklearn.decomposition import PCA

model = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_pred = model.fit_predict(X_sc)

pca  = PCA(n_components=2)
X_2d = pca.fit_transform(X_sc)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))
titles      = ['Agglomerative Clusters (predicted)', 'True Iris Species']
label_sets  = [labels_pred, y]
class_names = [['Cluster 0', 'Cluster 1', 'Cluster 2'], data.target_names.tolist()]

for ax, labs, title, names in zip(axes, label_sets, titles, class_names):
    for cls in np.unique(labs):
        mask = labs == cls
        ax.scatter(X_2d[mask, 0], X_2d[mask, 1],
                   label=names[cls], alpha=0.75, edgecolors='k', linewidths=0.3, s=55)
    ax.set_xlabel('PC 1', fontsize=11)
    ax.set_ylabel('PC 2', fontsize=11)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)

plt.suptitle('Hierarchical Clustering on Iris (2-D PCA projection)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

../../../../_images/625940d02a76c7de27244689a8e58c82e1742bed5fe886996ce2447d04dc6f68.png

Ward linkage cleanly separates setosa from the other two species and draws a reasonable boundary between versicolor and virginica — without ever being told the labels.

Effect of Linkage Criteria#

linkages  = ['ward', 'complete', 'average', 'single']
fig, axes = plt.subplots(1, 4, figsize=(18, 4))

for ax, lnk in zip(axes, linkages):
    try:
        m = AgglomerativeClustering(n_clusters=3, linkage=lnk)
        lbl = m.fit_predict(X_sc)
    except Exception:
        ax.set_title(f'{lnk}\n(incompatible)', fontsize=11)
        continue
    for cls in np.unique(lbl):
        mask = lbl == cls
        ax.scatter(X_2d[mask, 0], X_2d[mask, 1], alpha=0.75,
                   edgecolors='k', linewidths=0.3, s=45)
    ax.set_title(f'Linkage: {lnk}', fontsize=11, fontweight='bold')
    ax.set_xlabel('PC 1'); ax.set_ylabel('PC 2')
    ax.grid(True, alpha=0.3)

plt.suptitle('Iris — Effect of Linkage Criteria', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

../../../../_images/a56fa4b6b4eb0db16463d96930225d9ad5e0574a336049d3a92db5542081bb01.png

Ward and complete linkage produce the most compact, well-separated clusters. Single linkage suffers from the chaining effect, where a string of close points merges two large groups prematurely.

Strengths and Weaknesses#


Strengths	No need to pre-specify $k$; dendrogram reveals structure at every scale; deterministic (no random initialisation)
Weaknesses	$O(n^2)$ memory — impractical for large datasets; merges are irreversible (greedy); sensitive to noise and outliers

Tip

Use the dendrogram to choose $k$ by looking for the largest vertical gap before the first horizontal line crossing the cut threshold. That gap indicates a stable, well-separated partition.

Linkage	Distance between clusters \(A\) and \(B\)	Behaviour
Single	\(\min_{a \in A, b \in B} d(a, b)\)	Chaining effect; finds elongated clusters
Complete	\(\max_{a \in A, b \in B} d(a, b)\)	Compact clusters; sensitive to outliers
Average	$\frac{1}{	A
Ward	Increase in total WCSS after merge	Minimises variance; produces similar-sized clusters


Strengths	No need to pre-specify \(k\); dendrogram reveals structure at every scale; deterministic (no random initialisation)
Weaknesses	\(O(n^2)\) memory — impractical for large datasets; merges are irreversible (greedy); sensitive to noise and outliers