LLMs

LoRA

Low-rank adaptation: finetune the update, not the weights

01 · First principlesWhat does full finetuning actually cost?

The naive way to adapt a pretrained model is to keep training it. Every weight gets a gradient; every weight moves. The model has the capacity, the recipe is known — so what breaks?

Memory does, and not where people first guess. The weights themselves are the small part. With Adam, every trainable parameter drags along its gradient plus two optimizer moments. For a 7B model in mixed precision, the rough bill per parameter:

memory ≈ 2 (weight) + 2 (gradient) + 8 (Adam m, v in fp32)  + fp32 master copy ≈ 16 bytes / param
store backprop optimizer state

That is ~112 GB for 7B parameters before a single activation is counted (a catastrophe for a single GPU). And the bill repeats per task: ten finetuned variants means ten full copies of the model on disk and no way to serve them from one set of weights.

02 · The observationFinetuning updates are low-rank

The fix begins with an empirical fact about the update, not the model. Measure ΔW = Wfinetuned − Wpretrained after a successful full finetune and you find its spectrum is dominated by a handful of directions. The adaptation lives on a low-dimensional manifold inside the full parameter space — the "intrinsic dimension" of the task is tiny compared to d² (Aghajanyan et al. measured hundreds, not millions).

Intuition: pretraining already built the features. Finetuning does not need to carve new circuitry; it needs to re-weight and re-route what exists. Re-routing is a low-rank operation. The analogy: you are not rebuilding the piano, you are adjusting a few dozen tuning pins.

03 · The mechanismΔW = BA

So we stop training W and instead train a low-rank parameterisation of its change. Freeze W ∈ ℝd×d entirely, and add a bypass:

h = W x + ΔW x = W x + B A x,    B ∈ ℝd×r, A ∈ ℝr×d,   r ≪ d
x W (d × d) FROZEN — no grads, no Adam state A (r × d) init gaussian B (d × r) init zero + h trainable params: 2dr instead of d² (r = 8, d = 4096 → 0.4%)

The frozen weight and the low-rank bypass run in parallel; only the terracotta path trains.

Two small but load-bearing details. B is initialised to zero, so ΔW = 0 at step zero and training starts exactly at the pretrained model — no random perturbation to recover from. And the update is scaled by α/r, so changing the rank does not silently change the effective learning rate.

Why the memory collapses: gradients and Adam moments exist only for A and B. At r = 8 on a 4096-wide layer, that is roughly 0.4% of the parameters, so the 14-bytes-per-param overhead applies to almost nothing. The frozen W can even be quantised to 4 bits — that single extra move is QLoRA.

04 · InferenceThe merge trick

Adapter methods before LoRA (bottleneck adapters, prefix tuning) inserted extra modules in series, so every forward pass paid extra latency forever. LoRA's bypass is purely additive and linear, which means at deployment time it can disappear:

W' = W + B A   (computed once, offline)

After the merge, the network is architecturally identical to the original — same shapes, same kernels, zero added latency. And because adapters are megabytes rather than gigabytes, one base model can serve many tasks by keeping the adapters unmerged and swapping them per request.

05 · The priceTradeoffs

DimensionFull finetuneLoRA
Trainable params100%~0.1–1%
Optimizer memory~14 bytes / paramnegligible
Inference latencybaselinebaseline (after merge)
Storage per taskfull model copyMBs
Expressivityunconstrainedrank-r updates only
Catastrophic forgettingeasy to inducebounded by construction

The honest cost is the rank constraint. For style, format, instruction-following and domain adaptation, low-rank is empirically enough — those are re-weighting tasks. For injecting genuinely new knowledge or large behavioural shifts, full finetuning (or higher r) still wins; see Finetuning for when each is the right tool. The constraint is also a feature: a rank-r update simply cannot rewrite the model wholesale, which is a structural guard against forgetting.

Mental Model