Low-rank adaptation: finetune the update, not the weights
The naive way to adapt a pretrained model is to keep training it. Every weight gets a gradient; every weight moves. The model has the capacity, the recipe is known — so what breaks?
Memory does, and not where people first guess. The weights themselves are the small part. With Adam, every trainable parameter drags along its gradient plus two optimizer moments. For a 7B model in mixed precision, the rough bill per parameter:
That is ~112 GB for 7B parameters before a single activation is counted (a catastrophe for a single GPU). And the bill repeats per task: ten finetuned variants means ten full copies of the model on disk and no way to serve them from one set of weights.
The fix begins with an empirical fact about the update, not the model. Measure ΔW = Wfinetuned − Wpretrained after a successful full finetune and you find its spectrum is dominated by a handful of directions. The adaptation lives on a low-dimensional manifold inside the full parameter space — the "intrinsic dimension" of the task is tiny compared to d² (Aghajanyan et al. measured hundreds, not millions).
Intuition: pretraining already built the features. Finetuning does not need to carve new circuitry; it needs to re-weight and re-route what exists. Re-routing is a low-rank operation. The analogy: you are not rebuilding the piano, you are adjusting a few dozen tuning pins.
So we stop training W and instead train a low-rank parameterisation of its change. Freeze W ∈ ℝd×d entirely, and add a bypass:
The frozen weight and the low-rank bypass run in parallel; only the terracotta path trains.
Two small but load-bearing details. B is initialised to zero, so ΔW = 0 at step zero and training starts exactly at the pretrained model — no random perturbation to recover from. And the update is scaled by α/r, so changing the rank does not silently change the effective learning rate.
Adapter methods before LoRA (bottleneck adapters, prefix tuning) inserted extra modules in series, so every forward pass paid extra latency forever. LoRA's bypass is purely additive and linear, which means at deployment time it can disappear:
After the merge, the network is architecturally identical to the original — same shapes, same kernels, zero added latency. And because adapters are megabytes rather than gigabytes, one base model can serve many tasks by keeping the adapters unmerged and swapping them per request.
| Dimension | Full finetune | LoRA |
|---|---|---|
| Trainable params | 100% | ~0.1–1% |
| Optimizer memory | ~14 bytes / param | negligible |
| Inference latency | baseline | baseline (after merge) |
| Storage per task | full model copy | MBs |
| Expressivity | unconstrained | rank-r updates only |
| Catastrophic forgetting | easy to induce | bounded by construction |
The honest cost is the rank constraint. For style, format, instruction-following and domain adaptation, low-rank is empirically enough — those are re-weighting tasks. For injecting genuinely new knowledge or large behavioural shifts, full finetuning (or higher r) still wins; see Finetuning for when each is the right tool. The constraint is also a feature: a rank-r update simply cannot rewrite the model wholesale, which is a structural guard against forgetting.