What signal does the learner actually get
Before naming paradigms, ask the only question that matters: what signal does the learner receive, and where does it come from? Everything else (algorithms, losses, architectures) is downstream of the answer.
In supervised learning the signal is a label y attached to each input x by something outside the data — a human annotator, a sensor, a future event. The task is function approximation: find f with f(x) ≈ y on points you have not seen.
In unsupervised learning there is no y at all. The only thing available is the geometry and density of the inputs themselves: which points sit near which, which directions carry variance, which regions are crowded. The task is to expose that structure — as clusters, as a low-dimensional map, as a density estimate.
The naive plan is to label everything and stay supervised forever. It breaks on cost: labels require human time, and human time does not scale with data. The internet contains trillions of tokens of text and billions of images, almost none of them annotated for your task. A radiologist labelling scans costs hundreds of dollars per hour; the scans themselves are nearly free.
The opposite naive plan — stay fully unsupervised — breaks on ambiguity. Without a target, "structure" is underdetermined: the same customer data clusters one way by spending and another way by geography, and no loss function arbitrates. Unsupervised results need a downstream purpose to be judged against.
So the field's real history is the search for ways to get supervised-style training signal without paying for labels. That search produced the spectrum below.
A spectrum, not a binary: the axis is where the training signal comes from and what it costs.
Self-supervised learning is the trick that built modern AI: manufacture the labels from the data itself. Hide the next word and predict it; mask a patch and reconstruct it; corrupt an input and denoise it (the same move as denoising autoencoders). The objective is supervised in form — there is a concrete target and a loss — but no human ever wrote a label. Every document on the internet becomes billions of free (x, y) pairs. This is exactly how LLMs are pretrained (see LLM vs RNN vs S4).
Semi-supervised learning uses a small labelled set plus a large unlabelled one, on the assumption that p(x) tells you something about p(y|x) — decision boundaries should pass through low-density regions, not through the middle of a cluster.
| Paradigm | Signal | Canonical algorithms | Output |
|---|---|---|---|
| Supervised | (x, y) pairs | linear/logistic regression, trees, SVMs, k-NN, neural nets | predictor f(x) |
| Unsupervised | x alone | k-means, GMM, DBSCAN, PCA, autoencoders | groups, manifolds, densities |
| Self-supervised | x predicting parts of x | next-token LMs, masked LMs, contrastive (SimCLR, CLIP) | representations to transfer |
| Semi-supervised | few (x, y) + many x | pseudo-labelling, consistency regularisation | predictor, label-efficient |
| Reinforcement | delayed scalar reward | Q-learning, policy gradient, PPO | policy π(a | s) |
The practical decision is almost always economic. If labels are cheap and plentiful, plain supervised learning wins; it is the most direct route from data to the quantity you care about. When labels are expensive, the modern playbook is:
Pure unsupervised methods keep a separate, permanent role: exploration (what segments exist in these users?), compression, and anomaly detection, where no label could exist because the interesting events have not happened yet.