Applied ML

Communication Primitives

The eight verbs distributed training compiles to

01 · First principlesWhy a vocabulary at all

The moment training spans more than one GPU, some tensor on one device is needed on another, and the interconnect — not the arithmetic — becomes a thing you must budget for. NVLink moves hundreds of GB/s; cross-node Ethernet often moves tens. A 7B model's gradients are about 14 GB in bf16; shipping them naively, every step, is the kind of bill that quietly halves your throughput.

Rather than reasoning about raw sends and receives, every framework (NCCL underneath PyTorch, the XLA collectives underneath JAX) speaks a fixed vocabulary of collectives: operations in which every rank participates at once. DDP, FSDP, tensor parallelism, and pipeline parallelism are, at the wire level, just different sentences built from these eight verbs.

02 · The vocabularyWhat moves, who ends up with what

Primitive	What moves	Who has what after	Where ML uses it
broadcast	One rank's tensor to all	Everyone has the same full copy	Initial weight sync in DDP
scatter	One rank's tensor, split into N slices	Each rank has a different slice	Distributing data shards
gather	Each rank's slice to one rank	One rank has the concatenation	Collecting eval metrics to rank 0
allgather	Each rank's slice to every rank	Everyone has the full concatenation	FSDP parameter materialisation
reduce	Everyone's tensor, summed, to one rank	One rank has the sum	Rare in training loops
reduce-scatter	Everyone's tensor, summed, sliced	Each rank has a different 1/N of the sum	FSDP / ZeRO gradient sync
allreduce	Everyone's tensor, summed, to all	Everyone has the full sum	DDP gradient sync, TP activations
all-to-all	Slice (i,j) goes from rank i to rank j	Everyone holds one slice from everyone	MoE expert routing

Reading the table: the "all" prefix means every rank ends up with the result; without it, one root rank does. The pairs (gather, allgather) and (reduce, allreduce) differ only in that final fan-out.

03 · The identityAllreduce = reduce-scatter + allgather

One algebraic fact carries most of modern memory-efficient training. An allreduce can be performed in two phases: first a reduce-scatter, after which each rank holds a fully summed 1/N slice; then an allgather, which reassembles the full summed tensor on every rank.

allreduce(g) = reduce_scatter(g) → allgather(·)

each rank: summed slice each rank: full sum

This is not just an implementation detail; it is the seam that ZeRO cuts along. DDP runs both phases back to back inside one call. ZeRO stops after the first phase, lets each rank update only the slice it owns, and defers (or repurposes) the second phase. Sharded training therefore pays essentially the same communication volume DDP already paid — the redundancy was in the storage, not the wire.

04 · The mechanismRing allreduce

The naive allreduce — everyone sends their whole tensor to everyone — moves N× the data and melts at scale. The ring algorithm arranges the N ranks in a circle and splits the tensor into N chunks. In the reduce-scatter phase, each rank repeatedly sends one chunk to its right neighbour and adds the chunk arriving from its left; after N−1 steps every rank holds one fully summed chunk. The allgather phase circulates those finished chunks for another N−1 steps.

Every link carries size/N bytes per step, all links busy simultaneously. After 2(N−1) steps the allreduce is done.

Count the traffic. Each rank sends (N−1) chunks of size/N in each phase, so total bytes sent per rank:

bytes/rank = 2 · size · (N−1)/N ≈ 2 · size

The factor (N−1)/N approaches 1, so per-rank traffic is roughly 2× the tensor size regardless of how many GPUs participate. That near-independence from N is the entire reason data parallelism scales: adding ranks adds compute without adding per-rank gradient traffic. The cost that does grow with N is latency — 2(N−1) sequential steps — which is why rings hurt for small tensors and why tree and hierarchical variants exist for large clusters.

05 · Reading systemsParallelism strategies as sentences

DDP: broadcast once at start, then one allreduce of gradients per step.
FSDP / ZeRO-3: allgather parameters per layer (forward and backward), reduce-scatter gradients — the split allreduce, with an optimizer step wedged into the seam.
Tensor parallelism: allreduce (or allgather) of activations inside every layer, which is why it demands NVLink-class bandwidth.
Mixture-of-experts: all-to-all twice per MoE layer, to route tokens to experts and back — the one common workload where all-to-all dominates.

Habit worth building: when a new distributed scheme appears, translate it into this vocabulary first. The memory and bandwidth bills follow mechanically from which verbs it uses, on which tensors, how often.

Mental Model

Eight verbs; the "all" prefix means everyone gets the result instead of one root rank.
allreduce = reduce-scatter + allgather, and ZeRO is what happens when you stop halfway and work on the slices.
Ring allreduce: each rank moves ≈ 2·size bytes no matter how many ranks join — bandwidth-optimal, but latency grows with N.
Every parallelism strategy is a sentence in this vocabulary; its cost is legible from the verbs alone.