Research methods · The literature

The replication crisis & p-hacking

Try this first

I hand a clever analyst a dataset of pure random noise — no real effects in it — and a free hand to slice it however they like. Can they hand me back a "statistically significant" result? Decide yes or no before reading on.

The answer is almost always yes. Here's why. Imagine the analyst has 100 columns of random numbers and one column they want to "explain." They try the first 20 columns as predictors — nothing. They drop a few outliers — nothing. They split the data by sex, then by age — and there it is: among men under 40, one column lines up beautifully, p = 0.03. Significant. Publishable. And completely fake, because the data was noise from the start. Nobody faked a number. The analyst just kept looking until noise coughed up a pattern, then reported the one cut that worked as if it were the plan all along.

The one idea

A single significant study is the beginning of an investigation, not the verdict. Between raw data and a published p-value sit dozens of small choices — which outcome, which exclusions, when to stop, which subgroup — and that freedom is enough to find "significance" in pure noise. So treat one splashy finding as a lead to be replicated, not a fact to act on.

Researcher degrees of freedom

No study runs itself. At every step a human makes a defensible-sounding choice: which of several outcomes counts as the result, which participants to exclude as "outliers," whether to log-transform a variable, when to stop collecting data, which subgroup to spotlight. Each choice is individually reasonable. Collectively they form a branching tree of analyses — the garden of forking paths — and the researcher, consciously or not, can walk down whichever branch ends in p < 0.05. Doing this on purpose is p-hacking; doing it while sincerely believing you followed the obvious path is the same statistics with a clearer conscience.

Two habits make it worse. HARKing — Hypothesizing After the Results are Known — dresses an after-the-fact pattern as a prediction the study set out to test, erasing the fact that you tried twenty other ideas first. Subgroup-fishing reports the one slice ("women over 50") that reached significance and quietly buries the dozen slices that didn't. Both turn an exploratory hunch into the costume of a confirmed result.

One noisy dataset, many paths — only the path that "worked" gets reported.

Why so much fails to replicate

When teams went back and re-ran large batches of published studies under stricter rules, a sobering share didn't hold up. A landmark psychology project re-ran 100 well-known experiments and reproduced only about a third to a half of the original effects, usually weaker. A cancer-biology replication effort hit similar trouble, often unable to even rerun the methods as written. The cause isn't mass fraud — it's the forking paths plus publication bias: journals print the exciting "significant" cut and the boring null results sit in a drawer, so the published literature is a highlight reel of the luckiest analyses.

Exploration dressed as confirmation
The move	What it does	The honest version
p-hacking	Try many analyses, report the one under .05	"We explored; here's a lead"
HARKing	Present a post-hoc pattern as a prediction	"We noticed this after the fact"
Subgroup-fishing	Spotlight the one slice that hit	"One of many slices was significant"
Optional stopping	Stop collecting once p dips under .05	"We set the sample size in advance"

The main fix: commit before you look

Pre-registration is the antidote. Before collecting (or unlocking) the data, the researchers write down — in a public, timestamped record — their exact hypothesis, their one primary outcome, their exclusion rules, and their planned analysis. Now there is no garden to wander: there's one path, chosen in advance. Anything the data later suggests is fine to report, but it's labelled exploratory, a lead for the next study, not a confirmed result. When you read a strong claim, the single most useful question is: was this outcome pre-specified, or found after the fact?

Work one, then finish one

Worked: the multiple-comparisons trap. A team tests 20 different outcomes, each at the usual p < 0.05 threshold. Even if nothing is real, "significant at 0.05" means roughly a 1-in-20 false alarm rate per test — so across 20 independent tests you expect about one false positive by chance alone (1 in 20 × 20 ≈ 1). The team then writes up only that one "hit" and the paper reads like a genuine discovery. The fix is to correct for all the tests you ran, and to pre-register which outcome was the real target.

Your turn: a supplement trial's headline benefit isn't in its main result — it's in a subgroup, "women over 50," that was not the pre-specified outcome. What do you suspect, and what would change your mind? (Suspect a post-hoc subgroup found by fishing — likely noise from many slices tested. It only earns belief once a new, pre-registered study confirms that specific subgroup in advance.)

Why this matters

A lot of the nutrition and psychology research that supplement brands, wellness podcasters, and Reddit threads cite comes straight from this flexible, pre-replication literature — splashy single studies, generous subgroups, outcomes that may have been picked after the fact. So when you're deciding whether to spend money and months on a magnesium protocol, a nootropic, or a "this one study proves it" diet, treat that one study as a lead, not a conclusion. Ask whether it was pre-registered, whether the headline effect was the primary outcome, and whether anyone has independently replicated it. If the whole case rests on a single significant result, you're looking at the start of an investigation that nobody finished.

Recall check · no peeking

What are "researcher degrees of freedom," and how do they let p-hacking find significance in pure noise?
What do HARKing and subgroup-fishing each do to make exploration look like confirmation?
How does pre-registration shut the garden of forking paths?

Explain it back

In one plain sentence, explain to a friend why testing enough different things almost guarantees you'll find a "significant" result even when nothing is really there.

Learn · Shawon Chowdhury · a study guide, kept rough on purpose