Applied ML

Floating Point Representation

A fixed budget of bits against an infinite number line

01 · First principlesBase-2 scientific notation

There are uncountably many reals and 232 patterns in 32 bits, so any representation is a choice about which numbers to keep. Fixed-point keeps evenly spaced ones and dies at both ends of the scale. Floating point keeps numbers that are relatively evenly spaced — dense near zero, sparse far away — which fits computation, where what usually matters is relative error.

The format is base-2 scientific notation packed into bits: sign s, exponent e (stored with a bias), mantissa m (with an implicit leading 1 for normal numbers):

x  =  (−1)s · 1.m · 2e − bias
exponent bits buy range mantissa bits buy precision

Every format below is just a different split of one budget. Exponent bits decide how far from zero you can go; mantissa bits decide how many significant digits you carry. You cannot have both in 16 bits, which is the entire fp16-vs-bf16 story in mixed precision.

02 · The zooLayouts that matter in ML

■ SIGN ■ EXPONENT ■ MANTISSA · 17 PX / BIT FP32 TF32 19 bits — lives only inside tensor cores BF16 fp32's range, ~2–3 digits FP16 more digits, range tops out at 65504 FP8 E4M3 weights/activations: precision-leaning FP8 E5M2 gradients: range-leaning — fp16's exponent

One budget, different splits. Note bf16 and tf32 share fp32's 8-bit exponent; fp16 and e5m2 share a 5-bit one.

FormatBits (s/e/m)Max normalMachine eps (≈ rel. error)Role
fp321/8/23~3.4×1038~1.2×10−7Master weights, optimizer state, reductions
tf321/8/10~3.4×1038~4.9×10−4What "fp32 matmuls" silently become on Ampere+
fp161/5/1065504~4.9×10−4Inference; training with loss scaling
bf161/8/7~3.4×1038~3.9×10−3Default training compute type
fp8 e4m31/4/3448~6×10−2FP8 weights/activations (per-tensor scaling required)
fp8 e5m21/5/257344~1.3×10−1FP8 gradients

Machine epsilon is the spacing between 1.0 and the next representable number, roughly 2−(mantissa bits); it is the relative error of a single rounding. Every individual float operation is exact-then-rounded: the IEEE guarantee is fl(a∘b) = (a∘b)(1+δ) with |δ| ≤ eps. One operation is fine. The trouble is composition.

03 · The consequenceNon-associativity

Because every add rounds, (a + b) + c ≠ a + (b + c) in general. A two-line demonstration:

# fp32: 1e8 swallows the 1s — eps · 1e8 ≈ 12, so +1 rounds away
(1e8 + 1.0) + 1.0   # → 1e8        (each +1 lost separately)
1e8 + (1.0 + 1.0)   # → 100000002  (the 2 survives together)

Now scale that up: a sum over a million gradient elements has a different value for every ordering, and a GPU reduction's ordering depends on how the kernel split the work — which depends on block scheduling, which is not deterministic. This is why the same training script on the same data can produce bitwise-different losses run to run, why torch.use_deterministic_algorithms(True) costs speed (it forces fixed reduction orders), and why an allreduce across a different number of ranks gives bitwise-different gradients. Not a bug; the arithmetic itself is order-sensitive. (When the differences are large, you have a conditioning problem — see precision tricks.)

04 · The edgesSubnormals, and the values that are not numbers

Why this note exists: every numerics topic nearby — mixed precision, stability tricks, nondeterminism, FP8 scaling — reduces to three facts: exponent bits are range, mantissa bits are relative precision, and every operation rounds. Hold those and the rest is derivable.
Mental Model