Applied ML

Data Parallelism (DDP)

Replicate the model, shard the batch, average the gradients

01 · First principlesThe wall: one GPU is too slow

The model fits on a single GPU; training it just takes too long. The operational fact that makes data parallelism possible is a property of the loss, not of the hardware: the loss over a batch is a mean over examples, and the gradient of a mean is the mean of per-example gradients.

∇L(θ) = (1/B) Σ_i=1..B ∇ℓ_i(θ) = (1/N) Σ_k=1..N ∇L_k(θ)

mean of N per-shard gradients, B/N examples each

So the batch can be split across N workers, each computing a gradient on its shard, and the average of those shard gradients is exactly the gradient of the full batch. No approximation is involved.

02 · The mechanismReplicate, shard, allreduce

Every rank holds a full, identical copy of the model and the optimizer states.
Each step, the global batch is split into N shards; each rank runs forward and backward on its shard only.
Gradients are averaged across ranks with an allreduce.
Every rank applies the same averaged gradient with the same optimizer step, so the replicas stay bit-identical without ever exchanging weights.

The key invariant: after the allreduce, training is mathematically identical to single-GPU training with the large global batch. DDP does not change the optimization problem; it changes who computes which term of a sum.

The one genuine optimization-side consequence is that you are now training at batch size B×N, not B. Large-batch effects (the need to retune or warm up the learning rate, flatter-minima folklore) come from the batch size, not from the parallelism itself.

03 · Failure firstThe naive version wastes the network

The naive implementation finishes the entire backward pass, then allreduces all gradients in one blocking call. During backward the network sits idle; during the allreduce the GPUs sit idle. The two phases serialize, and at scale the communication phase is a noticeable fraction of the step.

The fix exploits an ordering fact: backward produces gradients layer by layer, last layer first. A gradient is final the moment its layer's backward completes, so its allreduce can start immediately while earlier layers are still computing. DDP groups parameters into buckets (roughly 25 MB by default in PyTorch) and fires an asynchronous allreduce per bucket as it fills.

Bucketing lets the allreduce of late-layer gradients run under the backward compute of early layers; only the last bucket's communication is left exposed.

The cost of bucketing is mild: a copy into flat bucket buffers, sensitivity to bucket size (too small means many launches, too large means less overlap), and the requirement that every rank produce gradients for the same parameters in the same step (conditional execution of submodules needs find_unused_parameters=True, which adds a graph walk per iteration).

04 · Scaling behaviourWhat it costs as N grows

Per step, each rank communicates roughly 2 bytes-on-the-wire per gradient byte (a ring allreduce moves about 2·size·(N−1)/N bytes per rank; see communication primitives). That volume is almost independent of N, which is why DDP scales gracefully: compute per rank stays fixed, communication per rank stays fixed, and only latency and stragglers erode efficiency.

Two practical erosions to expect:

Stragglers. The allreduce is a barrier; the step runs at the speed of the slowest rank. One slow GPU, one bad data shard, one noisy neighbour, and everyone waits.
Small per-rank batches. Scaling N at fixed global batch shrinks the per-rank batch until kernels stop saturating the GPU, and BatchNorm statistics get noisy (SyncBatchNorm exists, at the price of extra communication).

05 · The hard limitWhen DDP cannot help

DDP's premise is that everything — parameters, gradients, optimizer states, activations — fits on each single GPU, because each rank holds a full replica. In mixed precision with Adam that replica costs about 16 bytes per parameter before activations (the accounting is in the FSDP note). A 7B-parameter model needs roughly 112 GB of state per rank, which no single accelerator holds.

At that point the replication itself is the problem, and the answer is to shard the model state too: FSDP / ZeRO shards parameters, gradients, and optimizer states across the same ranks, or tensor and pipeline parallelism split the model's computation itself.

Mental Model

DDP is large-batch training in disguise: shard gradients averaged by allreduce equal the full-batch gradient exactly.
Replicas never exchange weights; they stay identical because they apply identical updates.
Bucketed, asynchronous allreduce hides communication under backward compute — only the last bucket is exposed.
Per-rank communication volume is roughly constant in N; stragglers and shrinking per-rank batches are what actually hurt.
The moment one replica does not fit on one GPU, DDP is the wrong tool — shard the state (FSDP) or the computation (TP/PP).