Applied ML

Tensor Parallelism

Split the matmul itself, and pick the split so the syncs cancel

01 · First principlesWhen a single layer is the problem

FSDP shards storage but still executes each layer whole, on one GPU. Two situations break that:

Tensor parallelism answers by splitting the weight matrices of a single layer across devices, so every device computes a piece of every matmul. The whole game is choosing the split so that synchronization happens rarely.

02 · Failure firstThe naive split syncs after every matmul

Split a weight matrix any way you like and the matmul still works, but the outputs land in pieces that must be reconciled before the next operation. Done carelessly, every single matmul ends in a collective. A transformer block has four big matmuls; four allreduces per block per forward pass would drown the compute, and the nonlinearity makes it worse: GeLU(x₁+x₂) ≠ GeLU(x₁) + GeLU(x₂), so partial sums must be reduced before any elementwise nonlinearity that follows them.

The design question is not "can we split" (we always can) but "which sequence of splits needs the fewest reductions, given where the nonlinearities sit".

03 · The mechanismThe Megatron pair: column then row

The MLP block computes Y = GeLU(XA), Z = YB. Megatron's trick is to split A by columns and B by rows:

A = [A₁ | A₂]  (columns)     B = [B₁ ; B₂]  (rows)
Z  =  GeLU(XA₁)B₁  +  GeLU(XA₂)B₂
rank 1 computes term 1rank 2 computes term 2one allreduce sums them

Why this works: a column split of A produces complete columns of XA — full values, not partial sums — so GeLU can be applied locally with no communication. Each GeLU output slice then meets exactly the rows of B it multiplies, producing a full-shape partial result. The two matmuls and the nonlinearity all run without talking; a single allreduce at the end sums the partials.

X REPLICATED A₁ A₂ A: COLUMN SPLIT GeLU local GeLU local NO COMMS B₁ B₂ B: ROW SPLIT allreduce Z = Z₁+Z₂ ONE SYNC / BLOCK HALF MEGATRON MLP · TWO MATMULS + GELU, ONE ALLREDUCE ■ GPU 0 ■ GPU 1

Column split makes the nonlinearity local; row split makes the second matmul consume exactly what the first produced. The pair needs one allreduce in forward (and one in backward).

Attention gets the same treatment for free: heads are already independent matmuls, so splitting Q, K, V by heads (a column split of the projections) and the output projection by rows reproduces the pattern exactly. A transformer block ends up with two allreduces per forward pass — one after the attention block, one after the MLP — instead of four or more.

04 · The costWhy TP stays inside a node

The cost structure is unforgiving. The allreduces carry activations (batch × sequence × hidden), they sit on the critical path of every block, and they cannot be overlapped the way DDP hides gradient comms — the next operation literally needs the reduced value. Per-GPU matmuls also shrink by the TP degree, so arithmetic intensity drops while communication frequency rises.

05 · PlacementWhere TP sits in the 3D stack

Large training runs compose the three parallelisms by their communication appetites: TP (chattiest) inside the node over NVLink, PP across nodes with its sparse point-to-point activations, and data parallelism (DDP or FSDP) across the remaining axis with its once-per-step gradient sync. Each tool covers the regime the others cannot.

Mental Model