General ML

Confidence Intervals

A procedure with a guarantee — not a probability about the truth

01 · First principlesA point estimate is a guess without a width

Your model scores 84.3% on the test set. Retrain with a different seed: 83.1%. Different test split: 85.0%. The number you reported was one draw from a distribution of possible numbers, and reporting it alone hides everything about how far it might wander. The honest object is the estimate plus its wobble — and the confidence interval is the standard packaging for that wobble.

The question a CI answers: if I reran this whole experiment many times, how widely would my estimate scatter around the truth?

02 · The definitionThe guarantee belongs to the procedure

A 95% confidence interval is the output of a procedure with this property: across repeated experiments, 95% of the intervals the procedure constructs contain the true value. The randomness is in the interval — the truth is a fixed number that does not move.

This forces the famous reading that everyone resists: once your particular interval [83.9, 85.1] is computed, it either contains the truth or it does not. The statement "there is a 95% probability the truth lies in [83.9, 85.1]" is, under the frequentist rules that built the interval, not even well formed. What you may say: "this interval came from a process that captures the truth 95% of the time."

TRUE VALUE (FIXED) MISS EXPERIMENT 1 EXPERIMENT n

Each line is the interval from one repetition of the experiment. The truth never moves; the intervals do. About 1 in 20 misses.

Frequentist · confidence interval
Truth fixed, interval random. "95%" describes the procedure's long-run capture rate. No probability statement about this particular interval is licensed.
Bayesian · credible interval
Truth treated as random (a posterior, via Bayes), interval fixed. "95%" means exactly what intuition wants: 95% posterior probability the parameter is inside. The price of the natural reading is a prior.

03 · MechanicsEstimate ± z · SE, courtesy of the CLT

Where does the recipe come from? The central limit theorem: a mean of n independent samples is approximately Gaussian around the truth, with standard deviation σ/√n (variance bookkeeping from the variance note). A Gaussian keeps 95% of its mass within 1.96 standard deviations, so:

x̄ ∼ approx N(μ, σ²/n)   (CLT)
SE = s/√n,   CI95 = x̄ ± 1.96 · SE

For a test-set accuracy p̂ on n examples, each prediction is a Bernoulli trial, so SE = √(p̂(1−p̂)/n). Concretely: 90% accuracy on n = 1,000 gives SE ≈ 0.95%, hence roughly ±1.9% — your "90.0%" is statistically indistinguishable from a competitor's 88.5%. The √n in the denominator is the quiet tyrant: halving the error bar costs four times the data.

04 · No formula?Bootstrap it

The z·SE recipe needs a formula for the standard error, and for most quantities you actually report — F1, AUC, BLEU, a median, a ratio of metrics — no clean formula exists. The bootstrap replaces the formula with simulation: treat the sample as a stand-in for the population, and replay the experiment by resampling.

  1. Resample n points from your test set with replacement.
  2. Recompute the metric on the resample.
  3. Repeat ~1,000–10,000 times; collect the metric values.
  4. Take the 2.5th and 97.5th percentiles: that span is the 95% CI.

It is crude, embarrassingly parallel, assumption-light, and the default answer to "how do I get error bars on this weird metric". Its main failure modes are tiny samples and statistics dominated by extremes.

05 · ML practiceError bars or it did not happen

ClaimWithout a CIWith a CI
"Our model beats the baseline by 0.8%"Possibly seed noiseCheckable: do the intervals overlap? Better, bootstrap the difference.
"New method: 76.2 on the benchmark"One draw from the seed lotteryMean ± CI over ≥3–5 seeds; seed variance is part of the result.
"Ablation X hurts performance"Could reverse on a different splitEffect size with uncertainty, not a coin-flip ranking.
Cheap habit: a paired bootstrap on the test set (resample once, evaluate both models on the same resample) costs a few lines of NumPy and settles most "is this improvement real?" arguments before they start.
Mental Model