General ML

Precision / Recall / F1 / AUC-ROC

What to measure when accuracy starts lying

01 · The failureThe 99%-accurate test that never says cancer

A disease affects 1% of patients. Consider the classifier return "healthy" — no model, no features, a constant. Its accuracy is 99%. It also detects zero cases, which was the entire job. Accuracy averages over a population in which the interesting class is a rounding error, so a model can score superbly by ignoring the problem.

The lesson generalises: any single number that pools both classes inherits the imbalance. Before trusting a metric, ask what the dumbest baseline scores on it.

02 · Ground truthThe confusion matrix comes first

Every metric in this note is arithmetic on four counts. Get the counts straight and the metrics stop being vocabulary:

Predicted positivePredicted negative
Actually positiveTP — caught itFN — missed it (the silent failure)
Actually negativeFP — false alarmTN — correct rejection

The two error types almost never cost the same. A missed cancer (FN) and an unnecessary biopsy (FP) are different tragedies; a blocked legitimate email and a delivered scam are different annoyances. Metric choice is really a statement about which cell you fear.

03 · The pairPrecision and recall, and the threshold between them

Precision = TP / (TP + FP)
Trust in positive calls. When the model raises its hand, how often is it right? The metric of false-alarm cost: spam filters, automated actions, anything where crying wolf is expensive.
Recall = TP / (TP + FN)
Coverage of real positives. Of everything that was truly there, how much did we find? The metric of miss cost: cancer screening, fraud detection, safety filters.

A scoring classifier becomes a decision rule only when you pick a threshold, and the threshold is a lever with these two on opposite ends. Lower it and you call more things positive: recall rises, precision falls. Raise it and you speak only when sure: precision rises, recall falls. Neither number means much without the other — recall 1.0 is trivially available by flagging everyone (the inverse of section 01's scam).

The F1 score compresses the pair via the harmonic mean:

F1 = 2·P·R / (P + R)

Harmonic, not arithmetic, because the harmonic mean is dragged toward the smaller value: P = 1.0 with R = 0.02 gives F1 ≈ 0.04, not the flattering 0.51 an average would report. F1 punishes lopsidedness — you cannot buy it with one good number. (It still hides which side is weak, weights both equally — see Fβ otherwise — and ignores TN entirely.)

04 · All thresholds at onceROC and AUC

Rather than defend one threshold, sweep them all. For each threshold plot the true-positive rate (recall) against the false-positive rate FP/(FP+TN); the sweep traces the ROC curve.

FALSE POSITIVE RATE → TRUE POSITIVE RATE → 0 1 1 COIN FLIP · AUC 0.5 GOOD MODEL · AUC ≈ 0.93 WEAKER · AUC ≈ 0.75 ONE THRESHOLD = ONE POINT

Each point on a curve is one threshold. The diagonal is random guessing; better models bow toward the top-left corner.

The area under the curve has a clean probabilistic meaning, and it is the right way to remember AUC:

AUC = P( score(random positive) > score(random negative) )

It is a pure ranking metric: 0.5 means the scores carry no order information, 1.0 means every positive outranks every negative. It is threshold-free and insensitive to calibration — which is both its virtue and its blind spot.

05 · The fine printWhen PR curves beat ROC

ROC's x-axis is FPR = FP/(FP+TN), and under heavy imbalance TN is astronomical. A fraud model that fires 10,000 false alarms against 10 million legitimate transactions has FPR = 0.1% — invisible on the ROC plot — while its precision may be a catastrophic 5%. The ROC curve looks immaculate because the negatives absorb any number of false positives into the denominator.

The precision–recall curve replaces FPR with precision, which has FP in a small denominator (TP+FP, the calls you actually made) and therefore feels every false alarm. Rule of thumb:

SituationPreferWhy
Roughly balanced classes; both error types matterROC / AUCStable, threshold-free, comparable across datasets.
Heavy imbalance; the positive class is the pointPR curve / AUPRCPrecision exposes false-alarm cost that FPR hides; the chance baseline (= positive rate) keeps you honest.
One deployed operating pointP, R at that threshold (+ a CI)Users experience a threshold, not a curve.

The deeper pattern is the one from Bayes: precision is a posterior, P(truly positive | flagged), so it depends on the base rate; TPR and FPR are likelihoods, so it does not. Choose the metric that conditions the way your deployment does.

Mental Model