The shape that makes optimisation honest
Local search — which is all gradient descent is — has one existential fear: arriving at a point that is the bottom of its neighbourhood but not the bottom of the world. A local method cannot tell the difference, because it only ever sees a neighbourhood. Convexity is precisely the class of functions on which that fear is unfounded. It is a global guarantee purchasable from a local definition, which is why it sits at the centre of classical optimisation.
A function f is convex when the straight line between any two points on its graph never dips below the graph:
The shape is a bowl (possibly a flat-bottomed or asymmetric one, but never a bowl with a bump). Read the inequality as: the function's value at a blend of inputs is at most the blend of its values — averaging inputs never hurts more than averaging outputs. Applied with a probability distribution instead of a two-point blend, this is Jensen's inequality, f(E[x]) ≤ E[f(x)], the workhorse behind half of ML theory (it is what makes KL divergence non-negative).
The chord test, visually. The red curve has a basin at left that local search cannot distinguish from the true minimum.
| Problem | Loss in the weights | Convex? |
|---|---|---|
| Linear regression | ‖Xw − y‖² — a quadratic, ∇² = 2XᵀX ⪰ 0 | yes |
| Logistic regression | log-loss of a linear score; log-sum-exp is convex | yes |
| SVM | hinge (max of two affine functions) + L2 | yes |
| Lasso / ridge | convex loss + convex penalty | yes |
| Any deep network | composition through nonlinearities | no |
The pattern: the classics are convex because the model is linear in its parameters and the loss is convex in the model's output — and convexity survives that composition. Stack a second layer and it dies immediately: even permuting a hidden layer's neurons gives a distinct weight vector with identical loss, so the loss surface has multiple separated minima, which a convex function cannot have.
By the classical account, training a deep net should routinely strand us in bad local minima. Empirically it almost never does, and the partial explanations have reshaped intuition. In very high dimension, most critical points are saddles rather than minima (escapable, especially with SGD noise), and the minima that exist tend to be close in loss value. Overparameterisation widens this further: with far more parameters than needed to fit the data, solutions form large connected low-loss regions rather than isolated traps, and near typical initialisations the landscape is locally benign. The surface is non-convex in theory and forgiving in practice.
So the honest summary: convexity is the regime where optimisation is guaranteed; deep learning is the regime where it is merely reliable, for reasons that are structural (width, depth, noise) rather than theorem-shaped. The classical theory still earns its keep — every convex subproblem inside the pipeline (logistic probes, linear heads, Lasso feature selection) inherits the guarantees for free.