// learn.shawon.ch / research-methods / p-values-and-effect-size STUDY GUIDE
← Research methods

Research methods · The numbers

p-values, CIs, effect size & power

Try this first

A 50,000-person trial compares two diets and reports a statistically significant average weight difference of 0.3 lb after a year. Are you impressed? Before reading on, decide what the word "significant" is actually telling you here — and what it isn't.

Here's the trick the headline plays. With 50,000 people, even a difference of a third of a pound has almost no wiggle room — the measurement is so precise that you can rule out "exactly zero" with confidence. So the study is right: the effect is probably real. It's also useless. Nobody changes their diet for 0.3 lb a year. "Statistically significant" answered a question you didn't ask. It told you the effect is probably not zero. It said nothing about whether it's big enough to care about. Those are different questions, and the second one is the one your body, your wallet, and your time actually live in.

The one idea

Read any result in this order: (1) how big is the effect — is it enough to matter? (2) how precise is it — read the confidence interval; could the truth be nothing, or could it be huge? (3) only then, the p-value — is it probably not just chance? Significance is a yes/no about chance, not a measure of size. Lead with size and precision; the p-value is the last thing you look at, not the first.

What each number actually answers

The three numbers feel interchangeable in headlines, but each answers a separate question, and confusing them is where most bad takes come from.

Three questions, three numbers
NumberAnswersWhat it does NOT tell you
Effect sizeHow big is the difference? (e.g. 0.3 lb, or 20% fewer events)Whether it's real or noise
Confidence intervalThe range the true effect plausibly sits in — how precise the estimate isThat the single point estimate is exactly right
p-valueHow surprising this result would be if there were truly no effectHow big the effect is, or that it matters

The confidence interval is the most underrated of the three, because it quietly contains the other two. A narrow interval sitting far from zero means a precise, real, sizeable effect. A wide interval that crosses zero means "we genuinely can't tell yet — the true effect could be nothing, or it could be large." Read the ends of the interval, not just the midpoint.

-10 +10 +20 +30 EFFECT SIZE → NO EFFECT (0) could be nothing clearly real Wide bar crossing 0 = unsure. Narrow bar past 0 = solid.
Read the whole bar, not the dot. Crossing zero = "can't tell yet."

Significant is not the same as large

The p-value depends heavily on sample size. Pour in enough subjects and a trivial effect crosses the p < 0.05 line, because a huge sample shrinks the noise until even a 0.3 lb difference stands out from zero. "Statistically significant" in a 50,000-person study can mean "real and tiny." So significance is necessary to take a finding seriously, but nowhere near sufficient to act on it.

Small studies that win tend to lie big

The opposite failure is sneakier. Power is a study's ability to detect a real effect; small samples have low power. A low-powered study mostly misses real effects — but on the occasions it does hit significance, it can only do so by landing on an unusually large estimate, because a modest one wouldn't have cleared the bar with so few people. So the "positive" results that survive from small studies are systematically inflated. This is the winner's curse: the splashy small study that gets shared is exactly the one most likely to have overstated the effect. When the bigger replication arrives, the effect usually shrinks.

Work one, then finish one

Worked: A trial of 40,000 people finds a supplement lowers a risk score by 1.5%, with a 95% CI of 0.9% to 2.1%, and p < 0.001. Walk the order. Size: 1.5% — trivial in real terms. CI: 0.9% to 2.1%, narrow and clear of zero, so the effect is precisely measured and almost certainly real. p-value: tiny, confirming it's not chance. Verdict: real but practically worthless. The huge sample bought you certainty about an effect too small to change a single decision. The p-value's drama is doing all the marketing work.

Your turn: A small study reports "20% improvement, p = 0.09, n = 18." How do you read it? (The p-value is above the usual 0.05 line, so by the conventional rule it isn't even "significant" — but with only 18 people the study is badly underpowered, so its confidence interval is very wide. The 20% could be a real and useful effect, or mostly noise; you genuinely can't tell yet. Treat it as a hint that deserves a bigger trial, not as evidence to act on — and remember that if it had hit significance, the winner's curse would make that 20% likely inflated.)

Why this matters

This is the exact move that defuses supplement marketing. A brand runs a study, finds a statistically significant effect on a surrogate marker — a blood number, a self-reported energy score — and the label trumpets "clinically proven." But significance on a tiny effect, or on a marker that isn't the outcome you care about, is precisely the result that looks impressive and means little. Before you spend money on the longevity capsule or the "proven" pre-workout, ask the three questions in order: How big was the effect, in units I care about? How wide was the confidence interval — could it be nothing? And only then, was it unlikely by chance? Most of the time the honest answer is "small, uncertain, and on the wrong outcome" — and that's worth knowing before you reach for your card.

Recall check · no peeking

  1. What does "statistically significant" tell you, and what does it specifically not tell you?
  2. You see a confidence interval that is wide and crosses zero. What does that mean about what you can conclude?
  3. What is the difference between effect size and significance — and what does low statistical power do to a study's "positive" results?

Explain it back

In one plain sentence, explain to a friend why a result can be "statistically significant" and still not matter at all.

Learn · Shawon Chowdhury · a study guide, kept rough on purpose