General ML

Convex Functions

The shape that makes optimisation honest

01 · First principlesWhat problem does convexity solve?

Local search — which is all gradient descent is — has one existential fear: arriving at a point that is the bottom of its neighbourhood but not the bottom of the world. A local method cannot tell the difference, because it only ever sees a neighbourhood. Convexity is precisely the class of functions on which that fear is unfounded. It is a global guarantee purchasable from a local definition, which is why it sits at the centre of classical optimisation.

02 · The definitionEvery chord lies above the function

A function f is convex when the straight line between any two points on its graph never dips below the graph:

f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y)    for all x, y and λ ∈ [0,1]

The shape is a bowl (possibly a flat-bottomed or asymmetric one, but never a bowl with a bump). Read the inequality as: the function's value at a blend of inputs is at most the blend of its values — averaging inputs never hurts more than averaging outputs. Applied with a probability distribution instead of a two-point blend, this is Jensen's inequality, f(E[x]) ≤ E[f(x)], the workhorse behind half of ML theory (it is what makes KL divergence non-negative).

CONVEX · chord stays above any local min is global NON-CONVEX · chord cuts through local minima can trap descent

The chord test, visually. The red curve has a basin at left that local search cannot distinguish from the true minimum.

03 · The prizeWhat convexity buys

  1. Every local minimum is global. If a lower point existed elsewhere, the chord toward it would slope downward, contradicting local minimality. Wherever descent stops, it stopped at the answer.
  2. Gradient descent provably converges with a properly chosen step (η ≤ 1/L for L-smooth f), and with strong convexity it converges linearly. No luck involved, no restarts needed.
  3. First-order information is globally honest: the tangent at any point underestimates f everywhere, f(y) ≥ f(x) + ∇f(x)ᵀ(y−x). A convex function cannot hide anything below its tangents.
Tests in practice. Definition (chords above) for general f; ∇²f ⪰ 0 (Hessian positive semi-definite, all curvatures non-negative) when twice differentiable; and composition rules — sums, maxima, and affine substitutions of convex functions stay convex, which is how convexity of complicated losses is verified piece by piece.

04 · The censusWhich ML problems are convex

ProblemLoss in the weightsConvex?
Linear regression‖Xw − y‖² — a quadratic, ∇² = 2XᵀX ⪰ 0yes
Logistic regressionlog-loss of a linear score; log-sum-exp is convexyes
SVMhinge (max of two affine functions) + L2yes
Lasso / ridgeconvex loss + convex penaltyyes
Any deep networkcomposition through nonlinearitiesno

The pattern: the classics are convex because the model is linear in its parameters and the loss is convex in the model's output — and convexity survives that composition. Stack a second layer and it dies immediately: even permuting a hidden layer's neurons gives a distinct weight vector with identical loss, so the loss surface has multiple separated minima, which a convex function cannot have.

05 · The modern puzzleNon-convex, yet trains anyway

By the classical account, training a deep net should routinely strand us in bad local minima. Empirically it almost never does, and the partial explanations have reshaped intuition. In very high dimension, most critical points are saddles rather than minima (escapable, especially with SGD noise), and the minima that exist tend to be close in loss value. Overparameterisation widens this further: with far more parameters than needed to fit the data, solutions form large connected low-loss regions rather than isolated traps, and near typical initialisations the landscape is locally benign. The surface is non-convex in theory and forgiving in practice.

So the honest summary: convexity is the regime where optimisation is guaranteed; deep learning is the regime where it is merely reliable, for reasons that are structural (width, depth, noise) rather than theorem-shaped. The classical theory still earns its keep — every convex subproblem inside the pipeline (logistic probes, linear heads, Lasso feature selection) inherits the guarantees for free.

Mental Model