Cluster Evaluation: Measuring Quality

6.3.1.4. Cluster Evaluation: Measuring Quality#

You have learned how to perform clustering using K-Means Clustering, Hierarchical Clustering, and DBSCAN: Density-Based Clustering methods. The natural next question is: how do we know whether the clusters we obtained are actually good?

Unlike supervised learning, clustering does not usually come with ground truth labels. There is no direct accuracy measure. Instead, we must evaluate clusters indirectly by measuring their internal structure.

A good clustering should satisfy two key properties:

Cohesion: Points within the same cluster should be similar to one another.
Separation: Points in different clusters should be clearly distinct.

Cluster evaluation metrics are designed to quantify these ideas.

The Challenge: No Ground Truth#

In supervised learning, we compare predictions against known labels. In clustering, we typically have no such labels. This makes evaluation inherently more subtle.

There are two broad categories of evaluation metrics:

1. Internal Metrics#

These rely only on the feature data (X). They assess how well the clustering structure fits the data itself.

Silhouette Score
Davies-Bouldin Index
Calinski-Harabasz Index

2. External Metrics#

These require true labels. They are used when ground truth is available for validation purposes.

Adjusted Rand Index (ARI)
Normalized Mutual Information (NMI)

In most real applications, internal metrics are the primary tools, since labels are rarely available.

Silhouette Score: The Most Widely Used Metric#

The Silhouette Score evaluates how well each point fits within its assigned cluster compared to other clusters.

For each point (i):

a(i): Average distance to other points in the same cluster
b(i): Average distance to points in the nearest neighboring cluster
s(i) = \(\frac{b(i) - a(i)}{\max(a(i), b(i))}\)

Interpretation of (s(i)):

+1: The point is well matched to its own cluster and far from others
0: The point lies near a cluster boundary
-1: The point may be assigned to the wrong cluster

The overall Silhouette Score is simply the average of (s(i)) across all data points.

The following experiment compares silhouette scores for different values of (k).

# Try different k values
k_values = [2, 3, 4, 5]

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, k in enumerate(k_values):
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    labels = kmeans.fit_predict(X_sil)

    # Compute silhouette
    sil_score = silhouette_score(X_sil, labels)
    sil_samples = silhouette_samples(X_sil, labels)

    # Plot clusters
    colors = ['red', 'blue', 'green', 'orange', 'purple']
    for cluster_id in range(k):
        cluster_mask = labels == cluster_id
        axes[idx].scatter(X_sil[cluster_mask, 0], X_sil[cluster_mask, 1],
                         s=60, c=colors[cluster_id], alpha=0.6,
                         edgecolors='black', linewidths=0.5,
                         label=f'C{cluster_id} (sil={sil_samples[cluster_mask].mean():.2f})')

    axes[idx].scatter(kmeans.cluster_centers_[:, 0],
                     kmeans.cluster_centers_[:, 1],
                     s=300, c='yellow', marker='*',
                     edgecolors='black', linewidths=2)

    axes[idx].set_xlabel('Feature 1', fontsize=11)
    axes[idx].set_ylabel('Feature 2', fontsize=11)
    axes[idx].set_title(f'k={k}\nSilhouette Score: {sil_score:.3f}',
                       fontsize=12, fontweight='bold')
    axes[idx].legend(loc='best', fontsize=8)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

../../../../_images/c814ad73f1c312a889261634a65cb0ef0b2bd8762812306b56e2fffaa5a4b252.png

As you examine the plots, notice how both cluster structure and silhouette score change with (k). The optimal number of clusters typically corresponds to a high average silhouette value and visually well-separated groups.

Davies-Bouldin Index: Measuring Cluster Similarity#

The Davies-Bouldin Index quantifies how similar each cluster is to its most similar neighboring cluster.

It compares within-cluster scatter to between-cluster separation.
It penalizes clusters that overlap or are poorly separated.
Lower values indicate better clustering.
A value close to 0 represents strong separation.

Unlike the silhouette score, which is bounded between -1 and 1, the Davies-Bouldin Index has no fixed upper bound. Interpretation relies on relative comparison across models.

# Compare different k values
k_range = range(2, 8)
db_scores = []
sil_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    labels = kmeans.fit_predict(X_sil)

    db_score = davies_bouldin_score(X_sil, labels)
    sil_score = silhouette_score(X_sil, labels)

    db_scores.append(db_score)
    sil_scores.append(sil_score)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Davies-Bouldin (lower better)
axes[0].plot(k_range, db_scores, 'o-', linewidth=2, markersize=8, color='red')
axes[0].axvline(x=3, color='green', linestyle='--', linewidth=2,
               label='True k=3')
axes[0].set_xlabel('Number of Clusters (k)', fontsize=12)
axes[0].set_ylabel('Davies-Bouldin Index', fontsize=12)
axes[0].set_title('Davies-Bouldin Index\n(Lower is Better)',
                  fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Silhouette (higher better)
axes[1].plot(k_range, sil_scores, 'o-', linewidth=2, markersize=8, color='blue')
axes[1].axvline(x=3, color='green', linestyle='--', linewidth=2,
               label='True k=3')
axes[1].set_xlabel('Number of Clusters (k)', fontsize=12)
axes[1].set_ylabel('Silhouette Score', fontsize=12)
axes[1].set_title('Silhouette Score\n(Higher is Better)',
                  fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

../../../../_images/a9fe85c0ba1835b245bb832eafb5c7961a716f591156f0618e35c938c13cfe2e.png

When comparing different values of (k), look for the configuration that minimizes the Davies-Bouldin score while maintaining interpretability.

Calinski-Harabasz Index: The Variance Ratio#

The Calinski-Harabasz Index, also known as the Variance Ratio Criterion, evaluates the ratio of:

Between-cluster dispersion
Within-cluster dispersion

Intuitively, good clustering should maximize separation between clusters while minimizing variation inside clusters.

Higher values indicate better-defined clusters.

# Compute for different k
ch_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    labels = kmeans.fit_predict(X_sil)
    ch_score = calinski_harabasz_score(X_sil, labels)
    ch_scores.append(ch_score)

# Plot all three metrics together
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Silhouette
axes[0].plot(k_range, sil_scores, 'o-', linewidth=2, markersize=8, color='blue')
axes[0].axvline(x=3, color='red', linestyle='--', linewidth=2, label='Best k')
axes[0].set_xlabel('k', fontsize=12)
axes[0].set_ylabel('Silhouette Score', fontsize=12)
axes[0].set_title('Silhouette\n(Higher Better)', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Davies-Bouldin
axes[1].plot(k_range, db_scores, 'o-', linewidth=2, markersize=8, color='red')
axes[1].axvline(x=3, color='green', linestyle='--', linewidth=2, label='Best k')
axes[1].set_xlabel('k', fontsize=12)
axes[1].set_ylabel('Davies-Bouldin Index', fontsize=12)
axes[1].set_title('Davies-Bouldin\n(Lower Better)', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Calinski-Harabasz
axes[2].plot(k_range, ch_scores, 'o-', linewidth=2, markersize=8, color='green')
axes[2].axvline(x=3, color='red', linestyle='--', linewidth=2, label='Best k')
axes[2].set_xlabel('k', fontsize=12)
axes[2].set_ylabel('Calinski-Harabasz Score', fontsize=12)
axes[2].set_title('Calinski-Harabasz\n(Higher Better)', fontsize=12, fontweight='bold')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

../../../../_images/cc2d67a2d2595438f023f1508427dfe423282005abce10c783d57cc3849eb00c.png

When multiple metrics agree on the same value of (k), confidence in that clustering choice increases.

Comparing Algorithms#

Evaluation metrics are also useful for comparing different clustering algorithms on the same dataset.

Here we compare:

K-Means
Hierarchical Clustering
DBSCAN

All models are applied to standardized data to ensure fair comparison. Distance-based metrics are highly sensitive to feature scale, so standardization is essential.

# Scale data
scaler = StandardScaler()
X_sil_scaled = scaler.fit_transform(X_sil)

# Try different algorithms
algorithms = {
    'K-Means': KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42),
    'Hierarchical': AgglomerativeClustering(n_clusters=3, linkage='ward'),
    'DBSCAN': DBSCAN(eps=0.5, min_samples=5)
}

results = []

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, (name, algo) in enumerate(algorithms.items()):
    labels = algo.fit_predict(X_sil_scaled)

    # Skip DBSCAN if only noise
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

    if n_clusters > 1:
        sil = silhouette_score(X_sil_scaled, labels)
        db = davies_bouldin_score(X_sil_scaled, labels)
        ch = calinski_harabasz_score(X_sil_scaled, labels)
    else:
        sil, db, ch = np.nan, np.nan, np.nan

    results.append({
        'Algorithm': name,
        'Clusters': n_clusters,
        'Silhouette': sil,
        'Davies-Bouldin': db,
        'Calinski-Harabasz': ch
    })

    # Plot
    unique_labels = set(labels)
    colors = ['red', 'blue', 'green', 'orange']

    for k in unique_labels:
        if k == -1:  # Noise
            class_mask = labels == k
            axes[idx].scatter(X_sil_scaled[class_mask, 0],
                            X_sil_scaled[class_mask, 1],
                            s=50, c='gray', marker='x', alpha=0.5)
        else:
            class_mask = labels == k
            axes[idx].scatter(X_sil_scaled[class_mask, 0],
                            X_sil_scaled[class_mask, 1],
                            s=50, c=colors[k % len(colors)], alpha=0.6,
                            edgecolors='black', linewidths=0.5)

    axes[idx].set_xlabel('Feature 1 (scaled)', fontsize=11)
    axes[idx].set_ylabel('Feature 2 (scaled)', fontsize=11)
    if not np.isnan(sil):
        axes[idx].set_title(f'{name}\nSilhouette: {sil:.3f}',
                           fontsize=12, fontweight='bold')
    else:
        axes[idx].set_title(f'{name}\n(Evaluation failed)',
                           fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

results_df = pd.DataFrame(results)
display(results_df)

../../../../_images/69d7bbdb0f4813d993514855eade715cf205eec94c7d64d13548eafd91b88209.png

	Algorithm	Clusters	Silhouette	Davies-Bouldin	Calinski-Harabasz
0	K-Means	3	0.893395	0.148312	9577.258358
1	Hierarchical	3	0.893395	0.148312	9577.258358
2	DBSCAN	3	0.893395	0.148312	9577.258358

Notice that different algorithms may produce different numbers of clusters. Metrics help quantify quality, but visualization remains critical. A strong metric score does not automatically imply meaningful structure.

External Metrics: When Labels Are Available#

Occasionally, you may have ground truth labels available for validation. In such cases, external metrics provide a direct way to measure agreement between predicted clusters and true classes.

Two common external metrics:

Adjusted Rand Index (ARI)
- Range: [-1, 1]
- 1 indicates perfect agreement
Normalized Mutual Information (NMI)
- Range: [0, 1]
- 1 indicates perfect agreement

Unlike internal metrics, external metrics evaluate alignment with known labels rather than structural properties alone.

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for ax, labels_plot, title in zip(
        axes,
        [y_iris, labels_pred],
        ['True Species (Ground Truth)',
         f'K-Means  |  ARI = {ari:.3f}, NMI = {nmi:.3f}']):
    for cls in np.unique(labels_plot):
        mask = labels_plot == cls
        ax.scatter(X_iris_scaled[mask, 0], X_iris_scaled[mask, 2],
                   s=60, alpha=0.7, edgecolors='black', linewidths=0.5,
                   label=f'Class {cls}')
    ax.set_xlabel('Sepal length (scaled)', fontsize=12)
    ax.set_ylabel('Petal length (scaled)', fontsize=12)
    ax.set_title(title, fontsize=13, fontweight='bold')
    ax.legend(); ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

../../../../_images/7f76bbefc2661ea4ffd8f95012a2451281d35741224796290c2ec6891a77cc13.png

K-Means achieves an ARI of 0.62 and an NMI of 0.659, with an internal silhouette score of 0.46 and a Davies-Bouldin index of 0.834. The high ARI confirms that the discovered clusters closely align with the true species labels.

External metrics are useful for validation, but they should not be used for selecting models in purely unsupervised settings.

Practical Recommendations#

Choosing the Number of Clusters#

Use the Silhouette Score as your primary indicator. Combine it with the elbow method for K-Means when appropriate. Davies-Bouldin and Calinski-Harabasz provide complementary perspectives.

Look for agreement across multiple metrics rather than relying on a single number.

Comparing Algorithms#

Always compute metrics on the same standardized dataset.
Visualize clusters alongside metrics.
Be cautious with non-spherical or density-based structures where certain metrics may be misleading.

Metrics quantify structure, but interpretation requires context.

Silhouette Score Guidelines#

Range	Interpretation
0.70 – 1.00	Strong structure
0.50 – 0.70	Reasonable structure
0.25 – 0.50	Weak structure
< 0.25	No substantial structure

These ranges provide general intuition, not strict rules.

Common Pitfalls#

Optimizing a metric too aggressively can produce artificial solutions.
Failing to standardize features makes distance-based metrics unreliable.
High metric scores do not guarantee business relevance.

Domain knowledge often overrides small numerical differences. A cluster structure that supports meaningful decisions is more valuable than a marginal improvement in silhouette score.

Key Takeaways#

Important

Remember These Points:

Internal Metrics (No Labels)
- Silhouette Score: [-1, 1], higher better
- Davies-Bouldin: [0, ∞), lower better
- Calinski-Harabasz: [0, ∞), higher better
External Metrics (Need Labels)
- Adjusted Rand Index: [-1, 1], 1 = perfect match
- Normalized Mutual Info: [0, 1], 1 = perfect match
- Use for validation, not primary model selection
Silhouette Score (Most Used)
- Measures cohesion and separation
- Provides per-sample and overall insight
- Helpful for choosing k
Practical Workflow
- Compute multiple internal metrics
- Look for agreement across metrics
- Visualize clusters
- Apply domain knowledge
- Iterate
Limitations
- Metrics may disagree
- No single metric is perfect
- Visualization remains essential
Best Practice
- Use Silhouette as primary metric
- Davies-Bouldin as secondary check
- Always visualize results
- Validate with domain experts

Cluster evaluation is not about finding a perfect score. It is about combining quantitative metrics, visual inspection, and domain understanding to arrive at meaningful structure in data.

Cluster Evaluation: Measuring Quality

Contents

6.3.1.4. Cluster Evaluation: Measuring Quality#

The Challenge: No Ground Truth#

1. Internal Metrics#

2. External Metrics#

Silhouette Score: The Most Widely Used Metric#

Davies-Bouldin Index: Measuring Cluster Similarity#

Calinski-Harabasz Index: The Variance Ratio#

Comparing Algorithms#

External Metrics: When Labels Are Available#

Practical Recommendations#

Choosing the Number of Clusters#

Comparing Algorithms#

Silhouette Score Guidelines#

Common Pitfalls#

Key Takeaways#