Applied ML

Tensor Parallelism

Split the matmul itself, and pick the split so the syncs cancel

01 · First principlesWhen a single layer is the problem

FSDP shards storage but still executes each layer whole, on one GPU. Two situations break that:

One layer's working set (weights for the gathered layer, plus activations) does not fit on a single device — large hidden sizes make a single MLP block enormous.
Even when it fits, one GPU is too slow per layer, and pipeline depth or batch size cannot absorb the latency target (inference cares about this most).

Tensor parallelism answers by splitting the weight matrices of a single layer across devices, so every device computes a piece of every matmul. The whole game is choosing the split so that synchronization happens rarely.

02 · Failure firstThe naive split syncs after every matmul

Split a weight matrix any way you like and the matmul still works, but the outputs land in pieces that must be reconciled before the next operation. Done carelessly, every single matmul ends in a collective. A transformer block has four big matmuls; four allreduces per block per forward pass would drown the compute, and the nonlinearity makes it worse: GeLU(x₁+x₂) ≠ GeLU(x₁) + GeLU(x₂), so partial sums must be reduced before any elementwise nonlinearity that follows them.

The design question is not "can we split" (we always can) but "which sequence of splits needs the fewest reductions, given where the nonlinearities sit".

03 · The mechanismThe Megatron pair: column then row

The MLP block computes Y = GeLU(XA), Z = YB. Megatron's trick is to split A by columns and B by rows:

A = [A₁ | A₂] (columns) B = [B₁ ; B₂] (rows)
Z = GeLU(XA₁)B₁ + GeLU(XA₂)B₂

rank 1 computes term 1rank 2 computes term 2one allreduce sums them

Why this works: a column split of A produces complete columns of XA — full values, not partial sums — so GeLU can be applied locally with no communication. Each GeLU output slice then meets exactly the rows of B it multiplies, producing a full-shape partial result. The two matmuls and the nonlinearity all run without talking; a single allreduce at the end sums the partials.

Column split makes the nonlinearity local; row split makes the second matmul consume exactly what the first produced. The pair needs one allreduce in forward (and one in backward).

Attention gets the same treatment for free: heads are already independent matmuls, so splitting Q, K, V by heads (a column split of the projections) and the output projection by rows reproduces the pattern exactly. A transformer block ends up with two allreduces per forward pass — one after the attention block, one after the MLP — instead of four or more.

04 · The costWhy TP stays inside a node

The cost structure is unforgiving. The allreduces carry activations (batch × sequence × hidden), they sit on the critical path of every block, and they cannot be overlapped the way DDP hides gradient comms — the next operation literally needs the reduced value. Per-GPU matmuls also shrink by the TP degree, so arithmetic intensity drops while communication frequency rises.

TP therefore demands an NVLink-class interconnect (hundreds of GB/s, microsecond latency) and in practice stays within a single node, with degree 2–8.
Across nodes, over ordinary Ethernet or even InfiniBand, the per-block allreduces dominate the step and efficiency collapses; that regime belongs to pipeline parallelism and FSDP, which communicate less often.
Sequence parallelism is the usual companion: the norms and dropouts between blocks, which TP leaves replicated, get sharded along the sequence dimension, and the allreduce decomposes into reduce-scatter + allgather around them at no extra volume.

05 · PlacementWhere TP sits in the 3D stack

Large training runs compose the three parallelisms by their communication appetites: TP (chattiest) inside the node over NVLink, PP across nodes with its sparse point-to-point activations, and data parallelism (DDP or FSDP) across the remaining axis with its once-per-step gradient sync. Each tool covers the regime the others cannot.

Mental Model

Any split computes the right answer; a good split is one whose partial results stay locally usable until a single cheap sync.
Column-then-row is exactly that: complete values through the nonlinearity, partial sums only at the very end.
Attention is born tensor-parallel — heads are the split.
TP's communication is per-block, activation-sized, and unhideable, so it lives on NVLink and stays inside a node.
Two allreduces per transformer block forward is the score to beat; more means the split is wrong.