General ML

Activation Functions

The bend that keeps depth from collapsing

01 · First principlesWhy nonlinearity must exist

Stack linear layers without anything between them and the stack folds flat:

W3(W2(W1x)) = (W3W2W1)x = Wx

A hundred linear layers equal one linear layer; depth buys nothing. Every interesting function — XOR, a decision boundary that curves, anything compositional — is out of reach. The activation function is the bend inserted between layers so that composition compounds instead of collapsing. Its requirements are modest: nonlinear, cheap, and differentiable almost everywhere so backprop can pass through. The history of activations is the history of getting the third requirement's quality right.

02 · First generationSigmoid and tanh saturate

The early choices were smooth squashers: sigmoid σ(x) = 1/(1+e−x) mapping to (0,1), and tanh mapping to (−1,1). The breakage is in their tails. Once |x| is large the curve is flat, so the local derivative is ≈ 0 — the neuron is saturated. Backprop multiplies local derivatives along depth, and sigmoid's derivative is at most 0.25 even at its best point; a 20-layer product of factors ≤ 0.25 is at most 10−12. Gradients vanish, early layers stop learning. (Tanh is the lesser evil: zero-centred, slope 1 at the origin, but the saturating tails remain.)

03 · The fixReLU, and its own failure mode

ReLU(x) = max(0, x) abandons smoothness for gradient hygiene. On the positive half the derivative is exactly 1 — no saturation, no shrinking product, regardless of depth. It costs one comparison, and its induced sparsity (about half the units silent per input) turned out to be harmless or helpful. ReLU is most of why post-2012 depth became trainable.

Its failure mode is the other half: a unit whose pre-activation goes negative for every input has derivative exactly 0 everywhere it operates — a dead neuron, unrevivable by gradient descent because no gradient ever flows in. A large step or an unlucky init can kill a noticeable fraction of a layer permanently.

04 · RefinementsLeaky, GELU, SiLU

The refinements each patch the dead-zone while keeping the non-saturating right half. Leaky ReLU is the blunt patch: a small slope α ≈ 0.01 on the negative side, so no input region is ever gradient-free. GELU is the principled one — multiply x by the probability that a standard Gaussian stays below it, x·Φ(x), which is the expected value of ReLU under Gaussian noise on the input. The hard gate becomes a soft, smooth one: large negatives still go to zero, but the approach is differentiable and slightly non-monotonic near −1. SiLU/Swish, x·σ(x), is a near-twin discovered by architecture search.

x σ tanh ReLU GELU SATURATING (σ, tanh) vs NON-SATURATING (ReLU, GELU) soft dip, never exactly flat

σ and tanh flatten in both tails; ReLU is exactly linear on the right and exactly dead on the left; GELU rounds the corner with a small dip.

Why transformers picked GELU (and increasingly SiLU): at the scale and depth of transformers, smoothness near zero matters — the kink in ReLU sits exactly where normalised pre-activations concentrate, and a smooth gate there gives better-behaved curvature for the optimiser. Empirically GELU trained better in BERT/GPT-era ablations and the default stuck. (Gated variants such as SwiGLU, two linear projections with a SiLU gate, are the current refinement in LLM feed-forward blocks.)

05 · ChoosingThe short table

FunctionFormStrengthWeakness
Sigmoid1/(1+e−x)output is a probability — fine as a final gatesaturates both tails; never use in hidden layers
Tanh2σ(2x)−1zero-centred; survives in RNN gatesstill saturates
ReLUmax(0,x)cheapest, gradient exactly 1 when active; CNN defaultdead neurons
Leaky ReLUmax(αx,x)no dead zoneone more hyperparameter, marginal gains
GELU / SiLUx·Φ(x) / x·σ(x)smooth ReLU; transformer defaultslightly costlier; needs the scale to pay off
Default rule: ReLU for convnets and anything small (see CNNs), GELU/SiLU for transformers, sigmoid/tanh only where a bounded gate is the point. Match your init to the choice (He for the ReLU family).
Mental Model