Applied ML

Profiling

Find the wall before you push on it

01 · First principlesNever optimise unmeasured

The standard failure is not slow code; it is a week spent making a fast part faster. Someone hand-fuses a kernel that was 2% of the step, while the data loader starves the GPU for 30% of every iteration. Intuition about where time goes in a GPU program is poor, partly because execution is asynchronous: the Python line that "takes long" is often just the first one forced to wait for a queue of earlier kernels. Measurement is not a refinement of optimisation; it is the precondition for it.

The discipline: measure → attribute the time to one of three walls → apply only the fixes that move that wall → measure again. Every fix in the systems-ML toolbox addresses exactly one wall, which is why misdiagnosis wastes the whole effort.

02 · The mapThree regimes

RegimeThe wallTelltale signsWhat actually helps
Compute-bound FLOP/s of the chip Big matmuls dominate the trace; tensor-core utilisation high; achieved TFLOPS near spec Lower precision, better kernels, a smaller model. You are at the happy wall.
Bandwidth-bound HBM GB/s Trace full of short pointwise/norm kernels; SMs idle waiting on memory Fusion (torch.compile), fewer/larger ops, 16-bit activations
Comms / overhead-bound Interconnect, CPU, launch latency Gaps between kernels; NCCL kernels on the critical path; GPU idle while Python or the dataloader works Overlap comms with compute, no_sync, CUDA graphs / compile, more dataloader workers

03 · The modelArithmetic intensity and the roofline

Which wall a kernel hits is predictable before you run it. Define arithmetic intensity as the work done per byte moved to and from memory:

AI  =  FLOPs / bytes moved   ·   attainable FLOP/s = min(peak FLOP/s,  AI × bandwidth)

The chip has a fixed ratio too: an A100 offers roughly 312 bf16 TFLOPS against roughly 2 TB/s of HBM, a ridge point near 150 FLOPs/byte. Kernels below that intensity cannot be compute-bound no matter how clever the code; the memory system simply cannot feed the ALUs fast enough.

ARITHMETIC INTENSITY, FLOPs/BYTE (LOG) → FLOP/s (LOG) → 1 10 ~150 (ridge) slope = HBM bandwidth peak tensor-core FLOP/s pointwise add, AI ≈ 0.2 softmax, layernorm large matmul, AI in the hundreds

Under the slanted roof, only moving fewer bytes helps. On the flat roof, only doing fewer FLOPs (or lower precision) helps.

This single picture explains the modern kernel agenda. A pointwise op does 1 FLOP per element while moving 8–12 bytes — AI well below 1, hopelessly bandwidth-bound, running at perhaps 1% of peak FLOPS. A chain of such ops (bias, GeLU, dropout, residual) re-reads and re-writes the same tensor over and over. Fusion merges the chain into one kernel that reads once, computes everything in registers, and writes once — the FLOPs are unchanged and the kernel still finishes several times faster, because FLOPs were never the cost. FlashAttention is the same logic applied to attention's memory traffic.

04 · The toolsFrom humble timer to full trace

ToolWhat it showsReach for it when
Timer + torch.cuda.synchronize()Wall time of a region, honestlyAlways first; one number, no setup. Without the synchronize you are timing kernel launches, not kernels
torch.profilerPer-op CPU and CUDA time, exportable Chrome trace, stacksAttributing a step's time to ops; spotting gaps and launch overhead in the timeline
Nsight Systems / ComputeWhole-system timeline (CPU, GPU, NCCL, dataloader); per-kernel hardware countersCross-process and comms problems; confirming a specific kernel's achieved bandwidth or occupancy
torch.cuda.memory._record_memory_history snapshotEvery allocation with stack traces, as an interactive timelineOOMs and mystery memory growth; seeing what actually peaks (usually activations — see checkpointing)

Two habits prevent most measurement lies: discard the first iterations (compile, autotune, and allocator warmup pollute them — see JIT), and measure several steps of the real workload rather than a microbenchmark with the dataloader removed and caches hot.

05 · Reading resultsThe one number worth reporting

For training, the cleanest top-line metric is MFU (model FLOPs utilisation): the model's theoretical FLOPs per step divided by step time, as a fraction of the chip's peak. Large transformer runs commonly land around 35–50%; well below that, the roofline says you are paying one of the other two walls, and the trace tells you which. Throughput in tokens/s is what you ship; MFU is what tells you how much is left on the table.

Order of inspection: step time stable? → GPU busy (gaps = overhead/comms)? → busy time in matmuls (else bandwidth)? → matmuls near peak (else kernel/precision)? Four questions, asked in order, classify almost every slow training job.
Mental Model