General ML

Domain Adaptation

The model is fine; the world moved

01 · First principlesThe assumption that breaks silently

Every supervised guarantee rests on one premise: training and deployment data come from the same distribution. Nothing enforces it. Cameras change, users drift, hospitals differ, slang evolves. The model keeps emitting confident predictions on inputs it was never representative of — and unlike a crash, this failure produces no error message. Accuracy decays quietly until someone looks.

Domain adaptation is the study of what to do when the source distribution (where labels are) and the target distribution (where the model runs) differ.

02 · The taxonomyThree ways a distribution shifts

Write the joint as p(x, y) = p(y|x)·p(x). Each factor can move independently, and the remedy depends on which one did:

ShiftWhat changesExampleDiagnosticRemedy
Covariate p(x); p(y|x) intact New camera, new demographic Train a classifier to tell source from target inputs — if it can, p(x) moved Importance weighting; finetune on target inputs
Label p(y); p(x|y) intact Disease prevalence rises Compare predicted label frequencies to target reality Reweight / recalibrate the prior
Concept p(y|x) itself "Spam" changes meaning; fraud adapts Labelled target data only — nothing cheaper detects it Retrain; no reweighting can save a changed truth

The table rewards memorising: the first two shifts can in principle be corrected without new labels; concept shift cannot, because the function being learned is no longer the same function.

03 · The classic mechanismAligning feature distributions

When the target has inputs but no labels (unsupervised DA), the classic idea is to make the learned features domain-invariant: if a discriminator cannot tell which domain a feature vector came from, a classifier trained on source features should transfer.

DANN implements this adversarially with a gradient-reversal layer: a domain classifier tries to separate domains, while the feature extractor receives its gradient negated — it learns to mix the domains together while still serving the label classifier.

The catch: invariance is not free. Discarding domain information can also discard class-relevant information (aligning marginals does not align p(y|x)), and adversarial training inherits GAN-style instability. These methods help most when the shift is mostly stylistic (synthetic → real) rather than semantic.

04 · Operational realityWhat actually ships

In production, the boring pipeline usually beats the clever algorithm. The hierarchy that works: monitor (track input statistics and prediction distributions, alert on drift), collect (label a small stream of recent target data), retrain (regularly, on a window that reflects today's distribution). Pretrained foundations made this cheaper still — adapting a robust general representation with a little target data (see transfer learning) routinely outperforms purpose-built adaptation machinery.

The durable skill is not any single method. It is asking, whenever a deployed model degrades: which factor of p(x, y) moved?

Mental Model