Clustering in Python

This demo explores key clustering techniques including K-Means, Hierarchical Clustering, DBSCAN, the Elbow Method, and Evaluation Metrics.
Clustering is an unsupervised learning technique used to group similar objects without predefined labels.

Before running the script, install the required dependencies:

pip install numpy pandas scikit-learn scipy seaborn matplotlib

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import DBSCAN

Generating Sample Data

We start by creating a synthetic dataset for clustering using make_blobs:

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50, edgecolors='k', alpha=0.7)
plt.title("Original Data Distribution", fontsize=16)
plt.xlabel("Feature 1", fontsize=14)
plt.ylabel("Feature 2", fontsize=14)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

plt.tight_layout()
plt.savefig("data.png", dpi=300)

Expected Outcome: The dataset contains 4 distinct clusters.

`make_blobs` is a function in scikit-learn used to generate synthetic data for clustering. It creates a specified number of blobs (clusters) of points with Gaussian distribution, which can be used to test machine learning algorithms, especially for clustering tasks.

K-Means Clustering

K-Means is a partitioning-based clustering method that minimizes variance within clusters. K-means will always create exactly the number of clusters (k) you choose. If k is too small, it merges different groups, and if it’s too large, it overfits. So, picking the right k is key for good clustering. Lets try with k = 2

kmeans = KMeans(n_clusters=2, random_state=42, n_init='auto')
labels_kmeans = kmeans.fit_predict(X)

plt.figure(figsize=(5, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels_kmeans, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker="X", label="Centroids")
plt.title("K-Means Clustering Results")
plt.legend()
plt.savefig("k-means.png")

Finding the Optimal Number of Clusters (Elbow Method)

The Elbow Method helps determine the best k by examining the within-cluster variance.

inertia = []
K_range = range(1, 10)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto').fit(X)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels_kmeans, cmap='viridis', s=50, edgecolors='k', alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker="X", label="Centroids")
plt.title("K-Means Clustering Results", fontsize=16)
plt.xlabel("Feature 1", fontsize=14)
plt.ylabel("Feature 2", fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

plt.tight_layout()
plt.savefig("k-means.png", dpi=300)

Hierarchical Clustering (Agglomerative)

Hierarchical clustering builds a dendrogram that represents cluster relationships.

linked = linkage(X, method="single")

plt.figure(figsize=(10, 6))
dendrogram(linked)
plt.title("Hierarchical Clustering Dendrogram", fontsize=16)
plt.xlabel("Data Points", fontsize=14)
plt.ylabel("Distance", fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)

plt.tight_layout()
plt.savefig("dendogram_single.png", dpi=300)

Single: Merges clusters based on the shortest distance between points. Complete: Merges based on the longest distance between points. Average: Merges using the average distance between all pairs of points. Ward: Minimizes the variance within clusters, aiming for compact, spherical clusters.

for i in range(len(linkage_methods)):
    linked = linkage(X, method=linkage_methods[i])

    plt.figure(figsize=(10, 6))
    dendrogram(linked)
    plt.title(f"Hierarchical Clustering Dendrogram ({linkage_methods[i]})", fontsize=16)
    plt.xlabel("Data Points", fontsize=14)
    plt.ylabel("Distance", fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.5)

    plt.tight_layout()
    plt.savefig(f"dendogram_{linkage_methods[i]}.png", dpi=300)
    plt.show()

Density-Based Clustering (DBSCAN)

Unlike K-Means, DBSCAN detects clusters based on density rather than distance.

eps_values = [0.1, 0.2, 0.3, 0.5, 0.8, 1.0]

plt.figure(figsize=(15, 10))

for i, eps in enumerate(eps_values):
    dbscan = DBSCAN(eps=eps, min_samples=5)
    labels_dbscan = dbscan.fit_predict(X)
    
    plt.subplot(2, 3, i + 1) 
    plt.scatter(X[:, 0], X[:, 1], c=labels_dbscan, cmap='plasma', s=50, edgecolors='k', alpha=0.7)
    plt.title(f"DBSCAN with eps={eps}", fontsize=14)
    plt.xlabel("Feature 1", fontsize=12)
    plt.ylabel("Feature 2", fontsize=12)
    plt.xticks(fontsize=10)
    plt.yticks(fontsize=10)
    plt.grid(True, linestyle='--', alpha=0.5)

plt.tight_layout()
plt.savefig("dbscan_multiple_eps.png", dpi=300)

In DBSCAN, eps (epsilon) defines the maximum distance between two points for them to be considered as neighbors. If the distance between two points is less than or equal to eps, they’re part of the same cluster. A smaller eps creates more, smaller clusters, while a larger eps can merge clusters together.

Clustering Evaluation Metrics

We assess clustering quality using:

from sklearn.metrics import silhouette_score, davies_bouldin_score

silhouette_kmeans = silhouette_score(X, labels_kmeans)
davies_bouldin_kmeans = davies_bouldin_score(X, labels_kmeans)

print(f"Silhouette Score (K-Means): {silhouette_kmeans:.3f}")  # Higher is better
print(f"Davies-Bouldin Index (K-Means): {davies_bouldin_kmeans:.3f}")  # Lower is better

Silhouette Score: Measures how well-separated clusters are.
Davies-Bouldin Index: Measures cluster compactness. Lower values indicate better separation.

In clustering, the choices you make—like the number of clusters or the method used—have a big impact on the results. Since it’s unsupervised learning, monitoring and interpreting the outcomes is key. You need to make sure the clusters reflect the data well and adjust as needed.