Learning from almost nothing, by bringing almost everything
Classic supervised learning needs hundreds of examples per class. A person needs one ("this animal is an okapi") or zero ("a zebra-legged giraffe-like animal"). The difference is not a better learning rule; it is that the person arrives with a rich prior — a representation of animals, stripes, and legs — and the new class only needs to be placed within it, not learned from raw pixels.
Few-shot and zero-shot methods are all answers to one question: where does the prior come from, and how does a new class plug into it?
Zero-shot needs the unseen class to be describable. The mechanism: embed inputs and class descriptions into the same space, and classify by similarity.
CLIP is the canonical version: an image encoder f and text encoder g trained contrastively on image–caption pairs, so that "a photo of an okapi" lands near okapi photos. A new class costs one sentence, zero gradient steps. The same pattern powers zero-shot retrieval and classification in LLMs, where the "description" is simply the task instruction.
The cost: performance is bounded by how well language captures the visual (or task) distinction. Classes that are easy to see but hard to describe transfer poorly.
With a handful of examples, the older answer was meta-learning: train across many small tasks so the model learns how to adapt — either a metric space where new classes form clusters from a few points (prototypical networks), or an initialisation that finetunes well in a few steps (MAML).
Then LLMs quietly absorbed the problem. In-context learning places the k examples in the prompt; the forward pass itself performs the adaptation, with no weight updates at all. The pretrained sequence model acts as a general-purpose learner conditioned on a tiny dataset.
| Approach | Prior lives in | Adaptation step | Status |
|---|---|---|---|
| Metric meta-learning | Learned embedding space | Average k embeddings per class | Niche (vision) |
| MAML-style | Initialisation | Few gradient steps | Largely superseded |
| In-context learning | Pretrained LLM weights | None — examples in the prompt | The modern default |
This is transfer learning taken to its limit: adaptation shrinks from retraining, to finetuning, to a prompt.
Zero-shot claims deserve suspicion proportional to the size of the pretraining corpus. A model pretrained on the open web has plausibly seen most "unseen" classes, benchmarks included. Honest evaluation asks: is the model generalising from a description, or remembering an example? Contamination checks and genuinely novel, post-cutoff test sets are the only clean answers (and they are rare).