General ML

Activation Functions

The bend that keeps depth from collapsing

01 · First principlesWhy nonlinearity must exist

Stack linear layers without anything between them and the stack folds flat:

W₃(W₂(W₁x)) = (W₃W₂W₁)x = Wx

A hundred linear layers equal one linear layer; depth buys nothing. Every interesting function — XOR, a decision boundary that curves, anything compositional — is out of reach. The activation function is the bend inserted between layers so that composition compounds instead of collapsing. Its requirements are modest: nonlinear, cheap, and differentiable almost everywhere so backprop can pass through. The history of activations is the history of getting the third requirement's quality right.

02 · First generationSigmoid and tanh saturate

The early choices were smooth squashers: sigmoid σ(x) = 1/(1+e^−x) mapping to (0,1), and tanh mapping to (−1,1). The breakage is in their tails. Once |x| is large the curve is flat, so the local derivative is ≈ 0 — the neuron is saturated. Backprop multiplies local derivatives along depth, and sigmoid's derivative is at most 0.25 even at its best point; a 20-layer product of factors ≤ 0.25 is at most 10⁻¹². Gradients vanish, early layers stop learning. (Tanh is the lesser evil: zero-centred, slope 1 at the origin, but the saturating tails remain.)

03 · The fixReLU, and its own failure mode

ReLU(x) = max(0, x) abandons smoothness for gradient hygiene. On the positive half the derivative is exactly 1 — no saturation, no shrinking product, regardless of depth. It costs one comparison, and its induced sparsity (about half the units silent per input) turned out to be harmless or helpful. ReLU is most of why post-2012 depth became trainable.

Its failure mode is the other half: a unit whose pre-activation goes negative for every input has derivative exactly 0 everywhere it operates — a dead neuron, unrevivable by gradient descent because no gradient ever flows in. A large step or an unlucky init can kill a noticeable fraction of a layer permanently.

04 · RefinementsLeaky, GELU, SiLU

The refinements each patch the dead-zone while keeping the non-saturating right half. Leaky ReLU is the blunt patch: a small slope α ≈ 0.01 on the negative side, so no input region is ever gradient-free. GELU is the principled one — multiply x by the probability that a standard Gaussian stays below it, x·Φ(x), which is the expected value of ReLU under Gaussian noise on the input. The hard gate becomes a soft, smooth one: large negatives still go to zero, but the approach is differentiable and slightly non-monotonic near −1. SiLU/Swish, x·σ(x), is a near-twin discovered by architecture search.

σ and tanh flatten in both tails; ReLU is exactly linear on the right and exactly dead on the left; GELU rounds the corner with a small dip.

Why transformers picked GELU (and increasingly SiLU): at the scale and depth of transformers, smoothness near zero matters — the kink in ReLU sits exactly where normalised pre-activations concentrate, and a smooth gate there gives better-behaved curvature for the optimiser. Empirically GELU trained better in BERT/GPT-era ablations and the default stuck. (Gated variants such as SwiGLU, two linear projections with a SiLU gate, are the current refinement in LLM feed-forward blocks.)

05 · ChoosingThe short table

Function	Form	Strength	Weakness
Sigmoid	1/(1+e^−x)	output is a probability — fine as a final gate	saturates both tails; never use in hidden layers
Tanh	2σ(2x)−1	zero-centred; survives in RNN gates	still saturates
ReLU	max(0,x)	cheapest, gradient exactly 1 when active; CNN default	dead neurons
Leaky ReLU	max(αx,x)	no dead zone	one more hyperparameter, marginal gains
GELU / SiLU	x·Φ(x) / x·σ(x)	smooth ReLU; transformer default	slightly costlier; needs the scale to pay off

Default rule: ReLU for convnets and anything small (see CNNs), GELU/SiLU for transformers, sigmoid/tanh only where a bounded gate is the point. Match your init to the choice (He for the ReLU family).

Mental Model

Without a bend between layers, any depth of linear maps is one linear map; activations are why depth exists.
The design axis is the derivative, not the function: backprop multiplies local slopes, so flat regions are where learning dies.
Sigmoid/tanh saturate (vanishing gradients); ReLU fixes the right tail and breaks the left (dead neurons).
GELU = expected ReLU under Gaussian input noise — a smooth gate where normalised activations actually live.
Choosing is boring now: ReLU for convnets, GELU/SiLU for transformers, squashers only as deliberate gates.