General ML

Few-Shot / Zero-Shot Learning

Learning from almost nothing, by bringing almost everything

01 · First principlesThe gap the terms name

Classic supervised learning needs hundreds of examples per class. A person needs one ("this animal is an okapi") or zero ("a zebra-legged giraffe-like animal"). The difference is not a better learning rule; it is that the person arrives with a rich prior — a representation of animals, stripes, and legs — and the new class only needs to be placed within it, not learned from raw pixels.

Few-shot and zero-shot methods are all answers to one question: where does the prior come from, and how does a new class plug into it?

02 · Zero-shotA shared semantic space

Zero-shot needs the unseen class to be describable. The mechanism: embed inputs and class descriptions into the same space, and classify by similarity.

class(x) = argmax_c sim(f(x), g(text_c))

CLIP is the canonical version: an image encoder f and text encoder g trained contrastively on image–caption pairs, so that "a photo of an okapi" lands near okapi photos. A new class costs one sentence, zero gradient steps. The same pattern powers zero-shot retrieval and classification in LLMs, where the "description" is simply the task instruction.

The cost: performance is bounded by how well language captures the visual (or task) distinction. Classes that are easy to see but hard to describe transfer poorly.

03 · Few-shotFrom meta-learning to in-context learning

With a handful of examples, the older answer was meta-learning: train across many small tasks so the model learns how to adapt — either a metric space where new classes form clusters from a few points (prototypical networks), or an initialisation that finetunes well in a few steps (MAML).

Then LLMs quietly absorbed the problem. In-context learning places the k examples in the prompt; the forward pass itself performs the adaptation, with no weight updates at all. The pretrained sequence model acts as a general-purpose learner conditioned on a tiny dataset.

Approach	Prior lives in	Adaptation step	Status
Metric meta-learning	Learned embedding space	Average k embeddings per class	Niche (vision)
MAML-style	Initialisation	Few gradient steps	Largely superseded
In-context learning	Pretrained LLM weights	None — examples in the prompt	The modern default

This is transfer learning taken to its limit: adaptation shrinks from retraining, to finetuning, to a prompt.

04 · The caveatWas it really unseen?

Zero-shot claims deserve suspicion proportional to the size of the pretraining corpus. A model pretrained on the open web has plausibly seen most "unseen" classes, benchmarks included. Honest evaluation asks: is the model generalising from a description, or remembering an example? Contamination checks and genuinely novel, post-cutoff test sets are the only clean answers (and they are rare).

Mental Model

Few-shot ability is never about the few shots; it is about the prior the model brings to them.
Zero-shot = classify by similarity to a description in a shared embedding space; the description replaces the data.
Few-shot's modern form is in-context learning: the prompt is the training set, the forward pass is the learner.
Skepticism rule: the bigger the pretraining corpus, the weaker the meaning of "unseen".