The eight verbs distributed training compiles to
The moment training spans more than one GPU, some tensor on one device is needed on another, and the interconnect — not the arithmetic — becomes a thing you must budget for. NVLink moves hundreds of GB/s; cross-node Ethernet often moves tens. A 7B model's gradients are about 14 GB in bf16; shipping them naively, every step, is the kind of bill that quietly halves your throughput.
Rather than reasoning about raw sends and receives, every framework (NCCL underneath PyTorch, the XLA collectives underneath JAX) speaks a fixed vocabulary of collectives: operations in which every rank participates at once. DDP, FSDP, tensor parallelism, and pipeline parallelism are, at the wire level, just different sentences built from these eight verbs.
| Primitive | What moves | Who has what after | Where ML uses it |
|---|---|---|---|
| broadcast | One rank's tensor to all | Everyone has the same full copy | Initial weight sync in DDP |
| scatter | One rank's tensor, split into N slices | Each rank has a different slice | Distributing data shards |
| gather | Each rank's slice to one rank | One rank has the concatenation | Collecting eval metrics to rank 0 |
| allgather | Each rank's slice to every rank | Everyone has the full concatenation | FSDP parameter materialisation |
| reduce | Everyone's tensor, summed, to one rank | One rank has the sum | Rare in training loops |
| reduce-scatter | Everyone's tensor, summed, sliced | Each rank has a different 1/N of the sum | FSDP / ZeRO gradient sync |
| allreduce | Everyone's tensor, summed, to all | Everyone has the full sum | DDP gradient sync, TP activations |
| all-to-all | Slice (i,j) goes from rank i to rank j | Everyone holds one slice from everyone | MoE expert routing |
One algebraic fact carries most of modern memory-efficient training. An allreduce can be performed in two phases: first a reduce-scatter, after which each rank holds a fully summed 1/N slice; then an allgather, which reassembles the full summed tensor on every rank.
This is not just an implementation detail; it is the seam that ZeRO cuts along. DDP runs both phases back to back inside one call. ZeRO stops after the first phase, lets each rank update only the slice it owns, and defers (or repurposes) the second phase. Sharded training therefore pays essentially the same communication volume DDP already paid — the redundancy was in the storage, not the wire.
The naive allreduce — everyone sends their whole tensor to everyone — moves N× the data and melts at scale. The ring algorithm arranges the N ranks in a circle and splits the tensor into N chunks. In the reduce-scatter phase, each rank repeatedly sends one chunk to its right neighbour and adds the chunk arriving from its left; after N−1 steps every rank holds one fully summed chunk. The allgather phase circulates those finished chunks for another N−1 steps.
Every link carries size/N bytes per step, all links busy simultaneously. After 2(N−1) steps the allreduce is done.
Count the traffic. Each rank sends (N−1) chunks of size/N in each phase, so total bytes sent per rank:
The factor (N−1)/N approaches 1, so per-rank traffic is roughly 2× the tensor size regardless of how many GPUs participate. That near-independence from N is the entire reason data parallelism scales: adding ranks adds compute without adding per-rank gradient traffic. The cost that does grow with N is latency — 2(N−1) sequential steps — which is why rings hurt for small tensors and why tree and hierarchical variants exist for large clusters.