General ML

Unsupervised vs Supervised

What signal does the learner actually get

01 · First principlesOne question splits the field

Before naming paradigms, ask the only question that matters: what signal does the learner receive, and where does it come from? Everything else (algorithms, losses, architectures) is downstream of the answer.

In supervised learning the signal is a label y attached to each input x by something outside the data — a human annotator, a sensor, a future event. The task is function approximation: find f with f(x) ≈ y on points you have not seen.

In unsupervised learning there is no y at all. The only thing available is the geometry and density of the inputs themselves: which points sit near which, which directions carry variance, which regions are crowded. The task is to expose that structure — as clusters, as a low-dimensional map, as a density estimate.

supervised: learn p(y | x) · unsupervised: learn p(x) (or its shape)

The gist: supervised learning is told what is true; unsupervised learning must decide what is interesting. The first has an objective answer key; the second has only internal consistency.

02 · Failure firstWhy both must exist

The naive plan is to label everything and stay supervised forever. It breaks on cost: labels require human time, and human time does not scale with data. The internet contains trillions of tokens of text and billions of images, almost none of them annotated for your task. A radiologist labelling scans costs hundreds of dollars per hour; the scans themselves are nearly free.

The opposite naive plan — stay fully unsupervised — breaks on ambiguity. Without a target, "structure" is underdetermined: the same customer data clusters one way by spending and another way by geography, and no loss function arbitrates. Unsupervised results need a downstream purpose to be judged against.

So the field's real history is the search for ways to get supervised-style training signal without paying for labels. That search produced the spectrum below.

03 · The spectrumBetween the two poles

A spectrum, not a binary: the axis is where the training signal comes from and what it costs.

Self-supervised learning is the trick that built modern AI: manufacture the labels from the data itself. Hide the next word and predict it; mask a patch and reconstruct it; corrupt an input and denoise it (the same move as denoising autoencoders). The objective is supervised in form — there is a concrete target and a loss — but no human ever wrote a label. Every document on the internet becomes billions of free (x, y) pairs. This is exactly how LLMs are pretrained (see LLM vs RNN vs S4).

Semi-supervised learning uses a small labelled set plus a large unlabelled one, on the assumption that p(x) tells you something about p(y|x) — decision boundaries should pass through low-density regions, not through the middle of a cluster.

Reinforcement learning is a third axis, not a midpoint. The signal is evaluative rather than instructive: a reward says how good the chosen action was, never what the correct action would have been, and it arrives delayed and sparse. Supervised learning corrects you; RL only grades you.

04 · The mapParadigm → canonical algorithms

Paradigm	Signal	Canonical algorithms	Output
Supervised	(x, y) pairs	linear/logistic regression, trees, SVMs, k-NN, neural nets	predictor f(x)
Unsupervised	x alone	k-means, GMM, DBSCAN, PCA, autoencoders	groups, manifolds, densities
Self-supervised	x predicting parts of x	next-token LMs, masked LMs, contrastive (SimCLR, CLIP)	representations to transfer
Semi-supervised	few (x, y) + many x	pseudo-labelling, consistency regularisation	predictor, label-efficient
Reinforcement	delayed scalar reward	Q-learning, policy gradient, PPO	policy π(a \| s)

05 · PracticeWhen labels are expensive

The practical decision is almost always economic. If labels are cheap and plentiful, plain supervised learning wins; it is the most direct route from data to the quantity you care about. When labels are expensive, the modern playbook is:

Pretrain (or borrow a model pretrained) self-supervised on abundant raw data — the representation work is done for free.
Fine-tune or probe on the small labelled set you can afford (see transfer learning).
If even that is too few, lean on few-shot or zero-shot methods, which import priors instead of examples.

Pure unsupervised methods keep a separate, permanent role: exploration (what segments exist in these users?), compression, and anomaly detection, where no label could exist because the interesting events have not happened yet.

Mental Model

Ask "where does the training signal come from" — that one question locates any method on the map.
Supervised = approximate p(y|x) with an answer key; unsupervised = expose the shape of p(x) with no key.
Self-supervision manufactures the key from the data itself; that single trick is why LLMs could be trained at internet scale.
RL sits on a different axis entirely: evaluative, delayed signal instead of instructive labels.
In practice the choice is economics: labels cost human time, raw data is nearly free, so pretrain broad and fine-tune narrow.