Clustering

6.3.1. Clustering#

Clustering is the task of discovering natural groups hidden inside unlabeled data. Given a set of data points and no class labels, the algorithm organises observations so that points within the same group are more similar to each other than to points in other groups.

The goal is never handed to us. We do not tell the algorithm that customers should split into “bargain hunters” and “premium buyers”. We ask it to find whatever structure exists, and interpret the result ourselves.

Why Clustering?

Explore data before you build any predictive model—understand what categories exist.
Segment customers, users, documents, genes, or images into meaningful groups.
Summarise large datasets into a small set of representative prototypes.
Pre-process for supervised learning by adding cluster membership as a feature.

Three Families of Clustering Algorithms

Clustering algorithms differ in what they consider a cluster and how they find it.

Partition-based (e.g. K-Means): Divide data into k non-overlapping regions by minimising an objective function. Fast and scalable; assumes roughly spherical, similarly-sized clusters.
Hierarchical (e.g. Agglomerative): Build a nested tree of merges or splits. Produces a dendrogram you can cut at any level; no need to choose k in advance.
Density-based (e.g. DBSCAN): Define clusters as dense regions separated by sparse space. Finds arbitrarily-shaped clusters; identifies noise points automatically.