Research methods · The numbers

Base rates & why most findings can be false

Try this first

A lab screens 1,000 useless compounds for an effect on some disease — every one of them is genuinely inert. Each is tested at the usual p < 0.05. Roughly how many will come back looking like a significant winner, and how many of those actually work?

The trap in the hook is that p < 0.05 is not a promise that a result is real — it's an agreement to accept a 5% rate of false alarms when nothing is going on. Run that gamble 1,000 times on inert compounds and about 5% of them, roughly 50, will clear the bar by chance alone. None of the 50 works, because none of the 1,000 ever could. So every single "hit" is false. The p-value never lied; it answered a question about chance, not about truth. Whether a significant result is actually true depends on something the p-value never sees: how plausible the claim was to begin with.

The one idea

A "significant" result is only as trustworthy as the prior plausibility of the claim it supports. Combine a low base rate with ordinary error rates and most of your significant findings are false positives — the share that are actually true (the positive predictive value) can sit well under 50%. This is the math behind "extraordinary claims need extraordinary evidence."

Put numbers on it

The hook is the extreme case — a prior of zero. Real science is kinder, so let's use realistic numbers. Suppose a field tests 1,000 hypotheses where 10% are genuinely true (a reasonable guess for a field chasing real but hard effects) and the studies have 80% power — an 80% chance of catching a true effect when it's there. Set the false-positive rate at the conventional 5%. Now follow the four buckets.

Of the 1,000 hypotheses, 100 are true and 900 are false. Among the 100 true ones, 80% power detects 80 (the other 20 are real but missed — false negatives). Among the 900 false ones, a 5% false-positive rate flags 45 as significant by chance. So the pile of "significant" results is 80 + 45 = 125, and 45 of them — about 36% — are wrong. Roughly a third of your significant findings are false even under decent conditions.

1,000 tested → 80 true + 45 false hits → over a third of "significant" results are wrong.

The lower the prior, the worse it gets

Notice what's doing the work. Power and the p-threshold barely move; the prior drives everything. Drop the share of true hypotheses and the false-positive pile stays the same size while the true pile shrinks, so the false fraction balloons. Watch the same study quality applied to claims of different plausibility.

Same 80% power, same 5% false-positive rate — only the prior changes
Prior (true rate)	True positives	False positives	Share of hits that are false
50% (a plausible idea)	400	25	~6%
10% (our worked case)	80	45	~36%
1% (a wild claim)	8	49.5	~86%
0.1% (an "extraordinary" claim)	0.8	~50	~98%

The arithmetic is the same each time: true positives are 1000 × prior × 0.80, false positives are 1000 × (1 − prior) × 0.05. At a 1% prior you get 8 real hits against 49.5 false ones, so roughly six of every seven significant results are wrong. Push the prior low enough and a single p < 0.05 result is barely worth more than a coin flip turning up heads.

Work one, then finish one

Worked: Start from our case — prior 10%, power 80%, false-positive rate 5%, on 1,000 hypotheses. True positives: 1000 × 0.10 × 0.80 = 80. False positives: 1000 × 0.90 × 0.05 = 45. Significant pile: 80 + 45 = 125. Positive predictive value — the chance a significant result is real — is 80 / 125 = 0.64. So 36% of "significant" findings are false even with solid power and a respectable one-in-ten prior. Now halve the prior to 5% and redo it: true positives 1000 × 0.05 × 0.80 = 40, false positives 1000 × 0.95 × 0.05 = 47.5, PPV 40 / 87.5 ≈ 46% — now most of your hits are false. The lower the prior, the worse it gets.

Your turn: A single significant study reports that a rare herb cures a common condition that nothing else has ever touched. Why should your belief barely move? (Because the prior plausibility is very low — if a cheap, common herb really cured it, decades of medicine would likely have found it — so even a real p < 0.05 sits in a high-false-positive regime; one study can't lift a tiny prior far. It needs independent replication before the PPV climbs above a coin flip.)

Why this matters

This is the move that protects your wallet and your body from the supplement and biohacking world. The pitch is always "there's a study." And often there really is one — a single small trial with a p < 0.05 on some exotic compound, peptide, or longevity stack. But these claims start from a very low prior: most novel compounds don't do the dramatic thing claimed, and the splashy ones are exactly the ones a low base rate predicts will mostly be false. So before you spend money or swallow something, ask not just "was it significant?" but "how plausible was this before the study, and has anyone independently replicated it?" One significant study on a wild claim is weak evidence; a real effect survives replication. That single question deflates most of what gets sold to you.

Recall check · no peeking

What does positive predictive value mean here, and how does the base rate (prior) enter the calculation of whether a significant result is true?
Two studies have identical power and the same p-threshold, but one tests a plausible claim and one a wild claim. Why is the wild one's significant result far more likely to be false?
Why does screening a thousand hypotheses at p < 0.05 guarantee a pile of false positives even when nothing real is there?

Explain it back

In one plain sentence, tell a friend why a p < 0.05 result supporting a wild claim is probably still wrong.

Learn · Shawon Chowdhury · a study guide, kept rough on purpose