This is the web page for Introduction to Data Science at the University of Florida.
This demo explores key clustering techniques including K-Means, Hierarchical Clustering, DBSCAN, the Elbow Method, and Evaluation Metrics.
Clustering is an unsupervised learning technique used to group similar objects without predefined labels.
Before running the script, install the required dependencies:
pip install numpy pandas scikit-learn scipy seaborn matplotlib
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import DBSCAN
We start by creating a synthetic dataset for clustering using make_blobs
:
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50, edgecolors='k', alpha=0.7)
plt.title("Original Data Distribution", fontsize=16)
plt.xlabel("Feature 1", fontsize=14)
plt.ylabel("Feature 2", fontsize=14)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.savefig("data.png", dpi=300)
Expected Outcome: The dataset contains 4 distinct clusters.
make_blobs
is a function in scikit-learn used to generate synthetic data for clustering. It creates a specified number of blobs (clusters) of points with Gaussian distribution, which can be used to test machine learning algorithms, especially for clustering tasks.K-Means is a partitioning-based clustering method that minimizes variance within clusters.
K-means will always create exactly the number of clusters (k) you choose. If k is too small, it merges different groups, and if it’s too large, it overfits. So, picking the right k is key for good clustering. Lets try with k = 2
kmeans = KMeans(n_clusters=2, random_state=42, n_init='auto')
labels_kmeans = kmeans.fit_predict(X)
plt.figure(figsize=(5, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels_kmeans, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker="X", label="Centroids")
plt.title("K-Means Clustering Results")
plt.legend()
plt.savefig("k-means.png")
The Elbow Method helps determine the best k
by examining the within-cluster variance.
inertia = []
K_range = range(1, 10)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto').fit(X)
inertia.append(kmeans.inertia_)
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels_kmeans, cmap='viridis', s=50, edgecolors='k', alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker="X", label="Centroids")
plt.title("K-Means Clustering Results", fontsize=16)
plt.xlabel("Feature 1", fontsize=14)
plt.ylabel("Feature 2", fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.savefig("k-means.png", dpi=300)
Hierarchical clustering builds a dendrogram that represents cluster relationships.
linked = linkage(X, method="single")
plt.figure(figsize=(10, 6))
dendrogram(linked)
plt.title("Hierarchical Clustering Dendrogram", fontsize=16)
plt.xlabel("Data Points", fontsize=14)
plt.ylabel("Distance", fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig("dendogram_single.png", dpi=300)
Single: Merges clusters based on the shortest distance between points. Complete: Merges based on the longest distance between points. Average: Merges using the average distance between all pairs of points. Ward: Minimizes the variance within clusters, aiming for compact, spherical clusters.
for i in range(len(linkage_methods)):
linked = linkage(X, method=linkage_methods[i])
plt.figure(figsize=(10, 6))
dendrogram(linked)
plt.title(f"Hierarchical Clustering Dendrogram ({linkage_methods[i]})", fontsize=16)
plt.xlabel("Data Points", fontsize=14)
plt.ylabel("Distance", fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig(f"dendogram_{linkage_methods[i]}.png", dpi=300)
plt.show()
Unlike K-Means, DBSCAN detects clusters based on density rather than distance.
eps_values = [0.1, 0.2, 0.3, 0.5, 0.8, 1.0]
plt.figure(figsize=(15, 10))
for i, eps in enumerate(eps_values):
dbscan = DBSCAN(eps=eps, min_samples=5)
labels_dbscan = dbscan.fit_predict(X)
plt.subplot(2, 3, i + 1)
plt.scatter(X[:, 0], X[:, 1], c=labels_dbscan, cmap='plasma', s=50, edgecolors='k', alpha=0.7)
plt.title(f"DBSCAN with eps={eps}", fontsize=14)
plt.xlabel("Feature 1", fontsize=12)
plt.ylabel("Feature 2", fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig("dbscan_multiple_eps.png", dpi=300)
In DBSCAN, eps (epsilon) defines the maximum distance between two points for them to be considered as neighbors. If the distance between two points is less than or equal to eps, they’re part of the same cluster. A smaller eps creates more, smaller clusters, while a larger eps can merge clusters together.
We assess clustering quality using:
from sklearn.metrics import silhouette_score, davies_bouldin_score
silhouette_kmeans = silhouette_score(X, labels_kmeans)
davies_bouldin_kmeans = davies_bouldin_score(X, labels_kmeans)
print(f"Silhouette Score (K-Means): {silhouette_kmeans:.3f}") # Higher is better
print(f"Davies-Bouldin Index (K-Means): {davies_bouldin_kmeans:.3f}") # Lower is better
In clustering, the choices you make—like the number of clusters or the method used—have a big impact on the results. Since it’s unsupervised learning, monitoring and interpreting the outcomes is key. You need to make sure the clusters reflect the data well and adjust as needed.