Replicate the model, shard the batch, average the gradients
The model fits on a single GPU; training it just takes too long. The operational fact that makes data parallelism possible is a property of the loss, not of the hardware: the loss over a batch is a mean over examples, and the gradient of a mean is the mean of per-example gradients.
So the batch can be split across N workers, each computing a gradient on its shard, and the average of those shard gradients is exactly the gradient of the full batch. No approximation is involved.
The one genuine optimization-side consequence is that you are now training at batch size B×N, not B. Large-batch effects (the need to retune or warm up the learning rate, flatter-minima folklore) come from the batch size, not from the parallelism itself.
The naive implementation finishes the entire backward pass, then allreduces all gradients in one blocking call. During backward the network sits idle; during the allreduce the GPUs sit idle. The two phases serialize, and at scale the communication phase is a noticeable fraction of the step.
The fix exploits an ordering fact: backward produces gradients layer by layer, last layer first. A gradient is final the moment its layer's backward completes, so its allreduce can start immediately while earlier layers are still computing. DDP groups parameters into buckets (roughly 25 MB by default in PyTorch) and fires an asynchronous allreduce per bucket as it fills.
Bucketing lets the allreduce of late-layer gradients run under the backward compute of early layers; only the last bucket's communication is left exposed.
The cost of bucketing is mild: a copy into flat bucket buffers, sensitivity to bucket size (too small means many launches, too large means less overlap), and the requirement that every rank produce gradients for the same parameters in the same step (conditional execution of submodules needs find_unused_parameters=True, which adds a graph walk per iteration).
Per step, each rank communicates roughly 2 bytes-on-the-wire per gradient byte (a ring allreduce moves about 2·size·(N−1)/N bytes per rank; see communication primitives). That volume is almost independent of N, which is why DDP scales gracefully: compute per rank stays fixed, communication per rank stays fixed, and only latency and stragglers erode efficiency.
Two practical erosions to expect:
DDP's premise is that everything — parameters, gradients, optimizer states, activations — fits on each single GPU, because each rank holds a full replica. In mixed precision with Adam that replica costs about 16 bytes per parameter before activations (the accounting is in the FSDP note). A 7B-parameter model needs roughly 112 GB of state per rank, which no single accelerator holds.
At that point the replication itself is the problem, and the answer is to shard the model state too: FSDP / ZeRO shards parameters, gradients, and optimizer states across the same ranks, or tensor and pipeline parallelism split the model's computation itself.