LLMs

LoRA

Low-rank adaptation: finetune the update, not the weights

01 · First principlesWhat does full finetuning actually cost?

The naive way to adapt a pretrained model is to keep training it. Every weight gets a gradient; every weight moves. The model has the capacity, the recipe is known — so what breaks?

Memory does, and not where people first guess. The weights themselves are the small part. With Adam, every trainable parameter drags along its gradient plus two optimizer moments. For a 7B model in mixed precision, the rough bill per parameter:

memory ≈ 2 (weight) + 2 (gradient) + 8 (Adam m, v in fp32) + fp32 master copy ≈ 16 bytes / param

store backprop optimizer state

That is ~112 GB for 7B parameters before a single activation is counted (a catastrophe for a single GPU). And the bill repeats per task: ten finetuned variants means ten full copies of the model on disk and no way to serve them from one set of weights.

02 · The observationFinetuning updates are low-rank

The fix begins with an empirical fact about the update, not the model. Measure ΔW = W_finetuned − W_pretrained after a successful full finetune and you find its spectrum is dominated by a handful of directions. The adaptation lives on a low-dimensional manifold inside the full parameter space — the "intrinsic dimension" of the task is tiny compared to d² (Aghajanyan et al. measured hundreds, not millions).

Intuition: pretraining already built the features. Finetuning does not need to carve new circuitry; it needs to re-weight and re-route what exists. Re-routing is a low-rank operation. The analogy: you are not rebuilding the piano, you are adjusting a few dozen tuning pins.

03 · The mechanismΔW = BA

So we stop training W and instead train a low-rank parameterisation of its change. Freeze W ∈ ℝ^d×d entirely, and add a bypass:

h = W x + ΔW x = W x + B A x, B ∈ ℝ^d×r, A ∈ ℝ^r×d, r ≪ d

The frozen weight and the low-rank bypass run in parallel; only the terracotta path trains.

Two small but load-bearing details. B is initialised to zero, so ΔW = 0 at step zero and training starts exactly at the pretrained model — no random perturbation to recover from. And the update is scaled by α/r, so changing the rank does not silently change the effective learning rate.

Why the memory collapses: gradients and Adam moments exist only for A and B. At r = 8 on a 4096-wide layer, that is roughly 0.4% of the parameters, so the 14-bytes-per-param overhead applies to almost nothing. The frozen W can even be quantised to 4 bits — that single extra move is QLoRA.

04 · InferenceThe merge trick

Adapter methods before LoRA (bottleneck adapters, prefix tuning) inserted extra modules in series, so every forward pass paid extra latency forever. LoRA's bypass is purely additive and linear, which means at deployment time it can disappear:

W' = W + B A (computed once, offline)

After the merge, the network is architecturally identical to the original — same shapes, same kernels, zero added latency. And because adapters are megabytes rather than gigabytes, one base model can serve many tasks by keeping the adapters unmerged and swapping them per request.

05 · The priceTradeoffs

Dimension	Full finetune	LoRA
Trainable params	100%	~0.1–1%
Optimizer memory	~14 bytes / param	negligible
Inference latency	baseline	baseline (after merge)
Storage per task	full model copy	MBs
Expressivity	unconstrained	rank-r updates only
Catastrophic forgetting	easy to induce	bounded by construction

The honest cost is the rank constraint. For style, format, instruction-following and domain adaptation, low-rank is empirically enough — those are re-weighting tasks. For injecting genuinely new knowledge or large behavioural shifts, full finetuning (or higher r) still wins; see Finetuning for when each is the right tool. The constraint is also a feature: a rank-r update simply cannot rewrite the model wholesale, which is a structural guard against forgetting.

Mental Model

The cost of full finetuning is optimizer state, not weights; freeze W and the Adam bill vanishes.
Task adaptation is intrinsically low-dimensional, so parameterise the change as BA with r ≪ d.
B starts at zero: training begins exactly at the pretrained model.
BA is additive and linear, so it merges into W — zero inference latency, unlike serial adapters.
QLoRA = LoRA + 4-bit frozen base; same idea, smaller desk.