01 · First principlesWhat structure are we exploiting?
Clustering exists because real data is not uniform: points arrive in lumps, because they were generated by a small number of distinct processes (customer types, cell types, topics). With no labels available, the only usable evidence of those processes is geometry — points from the same process tend to sit near each other. Clustering is the bet that proximity in feature space is a proxy for shared origin (a bet that fails exactly when the features are scaled badly or irrelevant — the same disease that afflicts k-NN).
This is the canonical unsupervised task: there is no answer key, so every algorithm must first decide what "a group" even means, and each decides differently.
02 · k-meansThe objective, then the algorithm
k-means defines a group as a blob around a centre. Choose k centroids μ1…μk and an assignment of each point to one centroid, to minimise the within-cluster variance:
J = Σj=1..kΣx ∈ Cj ‖x − μj‖²
over clusterspoints in cluster j
Minimising J jointly over assignments and centroids is NP-hard, so we do coordinate descent: hold one block fixed, optimise the other, alternate.
Assign: with centroids fixed, each point's optimal move is to its nearest centroid (each term of J is minimised independently).
Update: with assignments fixed, the optimal centroid is the mean of its points (the mean minimises summed squared distance).
Repeat. Each step can only lower J, and J is bounded below, so the loop converges — but only to a local minimum.
The point worth remembering: k-means is not a heuristic that happens to work. It is exact coordinate descent on a stated objective. Knowing the objective tells you precisely when it must fail.
03 · Failure modesWhat the objective cannot see
Every blind spot of k-means is readable straight off J:
Spherical clusters only. ‖x − μ‖² is isotropic, so k-means carves space into Voronoi cells with straight boundaries. Elongated, ring-shaped, or differently-sized clusters get sliced wrong, however long you iterate.
k is an input, not an output. The objective falls monotonically as k grows (k = n gives J = 0), so J itself cannot choose k.
Initialisation sensitivity. Coordinate descent stops at the nearest local minimum, and bad starting centroids find bad minima. k-means++ fixes most of this: pick initial centroids one at a time, each new one sampled with probability proportional to its squared distance from the nearest existing centroid — spread out by construction, with a provable O(log k) approximation bound.
Hard assignments. A point lying between two centres is forced wholly into one cluster, and the centroids lurch as it flips.
Isotropic squared distance prefers round blobs; two parallel elongated clusters get cut across, not along.
04 · The friendsGMM, DBSCAN, hierarchical
Gaussian mixture models are k-means with the hard edges softened. Each cluster becomes a Gaussian with its own covariance, each point gets a responsibility (a probability of membership), and EM alternates exactly as k-means does:
E-step: rij = p(cluster j | xi) · M-step: refit each (πj, μj, Σj) weighted by rij
k-means is the limit of EM on a GMM as the covariances shrink to εI: responsibilities harden into nearest-centroid assignments. The covariances are what buy you elliptical clusters.
DBSCAN changes the definition of a group: a cluster is a connected region of high density, grown outward from "core" points that have at least minPts neighbours within radius ε. It finds arbitrary shapes (rings, spirals), discovers k by itself, and labels sparse points as noise — at the cost of two new hyperparameters and trouble when clusters have very different densities.
Hierarchical (agglomerative) clustering refuses to pick k at all: start with every point as its own cluster, repeatedly merge the closest pair, and record the merge tree. You cut the dendrogram at whatever level the application needs. The natural choice when the data genuinely has a taxonomy (species, organisational structure).
05 · Choosing kHonestly
Two standard tools, neither of which is an oracle:
Elbow: plot J against k and look for the bend where adding a cluster stops paying. In practice the bend is often a gentle slope and the "elbow" is wherever the reader wants it to be.
Silhouette: for each point, s = (b − a)/max(a, b), where a is mean distance to its own cluster and b to the nearest other cluster. Averaged over points it rewards tight, well-separated clusters and actually peaks at a specific k — but it inherits the spherical bias of distance-based scoring.
Honesty clause: if neither plot shows a clear winner, the data likely does not contain k crisp clusters, and the most defensible answer is "the clustering is a useful quantisation, not a discovered truth." Validate against a downstream task whenever one exists.
Mental Model
Clustering bets that proximity means shared origin; every algorithm is a different formalisation of "group."
k-means = coordinate descent on within-cluster variance: assign-to-nearest, recompute means, repeat; guaranteed to converge, only locally.
Its blind spots (round clusters, fixed k, init sensitivity) are all visible in the objective; k-means++ fixes the init.
GMM is soft k-means with covariances; DBSCAN redefines clusters as dense regions; hierarchical builds the whole merge tree.
Elbow and silhouette suggest k; nothing guarantees the data has one.