Statistics

P-Hacking and the Reproducibility Crisis

Last updated: March 2026 · Intermediate

Before you start

You should be comfortable with:

Probability Basics

Real-world applications

💊

Nursing

Medication dosages, IV drip rates, vital monitoring

The phrase “statistically significant” is supposed to be a seal of quality — evidence that a finding is real, not just noise. But over the past two decades, the scientific community has discovered that the way significance is sometimes achieved, reported, and published has undermined the reliability of a staggering amount of published research. Understanding how this happens does not require advanced statistics — it requires knowing a few key concepts and developing the habit of asking the right questions.

What Is Statistical Significance?

Before we can understand how significance gets abused, we need to understand what it means.

When researchers run an experiment, they start with a null hypothesis — the assumption that there is no real effect. For a new drug, the null hypothesis is “this drug has no effect compared to placebo.” The study then collects data and calculates a p-value.

The p-value is the probability of seeing results at least as extreme as the observed data, assuming the null hypothesis is true.

$\text{p-value} = P(\text{data this extreme or more} \mid \text{null hypothesis is true})$

By convention, if $p < 0.05$ , the result is declared statistically significant. The reasoning: if there really were no effect, there would be only a 5% chance of seeing data this extreme, so we reject the null hypothesis and conclude the effect is probably real.

Critical misconceptions to avoid:

A p-value of 0.03 does not mean “there is a 3% chance the result is wrong.” It means “if the null hypothesis were true, data this extreme would occur 3% of the time.”
A p-value does not tell you how large or important the effect is — only that it is unlikely to be zero.
The 0.05 threshold is a convention, not a law of nature. There is nothing magical about the boundary between $p = 0.049$ (significant!) and $p = 0.051$ (not significant).

What Is P-Hacking?

P-hacking is the practice of manipulating data analysis — often unconsciously — until a p-value below 0.05 is found. Researchers have many choices during analysis (which variables to include, how to define groups, which statistical test to use, whether to remove outliers), and each choice can nudge the p-value up or down. If enough choices are tried, a “significant” result can almost always be found, even when no real effect exists.

Common p-hacking techniques include:

Testing many different outcomes: A study on a new diet might measure weight, BMI, waist circumference, blood pressure, cholesterol, blood sugar, mood, and sleep quality. With eight outcomes, the chance of finding at least one with $p < 0.05$ by chance alone is substantial.
Adding or removing data points: Including or excluding certain participants based on criteria decided after seeing the data.
Testing many subgroups: Does the drug work for men? Women? People over 50? People under 50? People with high baseline levels? Each additional test is another opportunity for a false positive.
Trying different statistical tests: A t-test, a Mann-Whitney U test, a regression with different covariates — each may produce a slightly different p-value.
Optional stopping: Checking the p-value after every 10 participants and stopping data collection as soon as it dips below 0.05.

The term is also called data dredging, fishing, or running “researcher degrees of freedom.”

Example 1: The Jelly Bean Study

This example illustrates the problem with vivid clarity.

A research team tests whether jelly beans cause acne. They run a study and find no link ( $p = 0.56$ ). So they test each of the 20 jelly bean colors individually:

Red: $p = 0.42$
Blue: $p = 0.71$
Yellow: $p = 0.38$
Green: $p = 0.03$ (significant!)
…and 16 more with $p > 0.05$

The headline: “Study finds link between green jelly beans and acne!”

But here is the math. At $\alpha = 0.05$ , each test has a 5% chance of a false positive when there is no real effect. With 20 independent tests, the probability of getting at least one false positive is:

$P(\text{at least one false positive}) = 1 - (1 - 0.05)^{20} = 1 - 0.95^{20}$

Computing step by step: $0.95^{10} = 0.59874$ , and $0.59874^2 = 0.3585$ . Therefore:

$P(\text{at least one false positive}) = 1 - 0.3585 = 0.6415 \approx 64.2\%$

There is a 64.2% chance of finding at least one “significant” result purely by chance when running 20 tests with no real effect. The green jelly bean finding is almost certainly a false positive, but it is the only result that gets published and reported.

The Multiple Comparisons Problem

The jelly bean example illustrates a broader issue: the multiple comparisons problem. Every additional statistical test increases the overall probability of a false positive.

Number of Tests	P(at least one false positive)
1	5.0%
5	22.6%
10	40.1%
20	64.2%
50	92.3%
100	99.4%

The formula for each row: $P = 1 - 0.95^{k}$ , where $k$ is the number of independent tests.

At 100 tests, you are virtually guaranteed at least one false positive. This is why researchers who test many hypotheses must use correction methods — such as the Bonferroni correction, which divides the significance threshold by the number of tests:

$\alpha_{\text{corrected}} = \frac{0.05}{k}$

For 20 tests: $\alpha_{\text{corrected}} = \frac{0.05}{20} = 0.0025$ . The green jelly bean result at $p = 0.03$ would no longer be significant.

Publication Bias

P-hacking would be less damaging if the full picture of research were visible. But it is not — publication bias means that studies with positive, significant results are far more likely to be published than studies finding no effect.

This is called the file drawer problem: negative results end up in researchers’ file drawers, never published, never seen.

Here is how publication bias distorts knowledge:

Twenty research labs independently test whether a new supplement improves memory.
Nineteen labs find no significant effect. One lab — perhaps by chance, or by inadvertent p-hacking — finds $p = 0.04$ .
The nineteen null results are never published (journals reject them as “not interesting”).
The one positive result is published in a reputable journal.
The scientific literature now shows “one study found a significant effect” with no record of the nineteen failures.
Media reports: “Scientists discover supplement that boosts memory!”

The published evidence appears to show a clear finding, but it is a mirage created by selective reporting. The true evidence — 19 out of 20 studies found nothing — tells the opposite story.

The Reproducibility Crisis

These problems are not theoretical. In recent years, large-scale efforts to replicate published findings have revealed that a troubling proportion of scientific results fail to reproduce.

Key findings from replication projects:

Psychology: The Open Science Collaboration (2015) attempted to replicate 100 published psychology studies. Only about 39% produced significant results in the replication attempt. The average effect size in the replications was roughly half of the original.
Cancer biology: Begley and Ellis (2012) reported that scientists at a biotech firm attempted to replicate 53 “landmark” cancer biology studies. Only 11 (about 21%) could be confirmed.
Economics: Camerer et al. (2016) replicated 18 economics studies from top journals. About 61% replicated, with effect sizes averaging 66% of the originals.

These numbers do not mean that 61% (or 79%, or 39%) of published research is “wrong.” Failed replications can have many causes — different populations, different conditions, or simply bad luck. But the pattern is clear: a significant portion of published research is less reliable than the confidence with which it was originally presented.

The causes are interconnected:

P-hacking inflates the number of false positives that get submitted for publication.
Publication bias filters out the null results that would provide counterbalance.
Career incentives reward novel, significant findings over careful replications.
Small sample sizes make studies underpowered, increasing the chance that significant results are inflated or false.

How to Evaluate Research Claims

You do not need a statistics degree to be a critical reader of research. Here are the key questions to ask:

Was the hypothesis specified before data collection?

Pre-registration means the researchers publicly recorded their hypothesis, methods, and analysis plan before collecting any data. This makes p-hacking much harder because the analysis decisions are locked in before the results are known. Pre-registered studies are more trustworthy than studies where the hypothesis could have been formed after looking at the data.

Has the study been replicated?

A single study, no matter how well designed, is a single data point. Replication by independent teams using different samples is the gold standard of scientific evidence. If a finding has been replicated multiple times, it is far more reliable than a one-off result.

Is the sample size large enough?

Small studies are more vulnerable to both p-hacking and genuine statistical noise. A study of 20 people that finds a dramatic effect should be treated with much more skepticism than a study of 2,000 finding a modest effect. The smaller the study, the larger the effect must be for us to take it seriously.

Were multiple comparisons corrected for?

If the study tested many outcomes or many subgroups, was a correction method applied? If the paper reports one significant finding out of twenty tests without mentioning a correction, that finding is suspect.

Is the effect size meaningful?

Statistical significance tells you the effect is probably not zero. It does not tell you the effect is large enough to matter. A study might find that a new teaching method improves test scores by 0.3 points on a 100-point scale with $p = 0.01$ . That is statistically significant but practically meaningless — no student’s life is changed by 0.3 points.

Who funded the study?

Funding source does not automatically invalidate research, but it is relevant context. Industry-funded studies are more likely to produce results favorable to the funder. This is not necessarily due to fraud — it can result from subtle choices in study design, outcome selection, and reporting emphasis.

Statistical Significance vs. Practical Significance

This distinction is one of the most important in all of applied statistics.

Example 2: Statistically Significant but Clinically Meaningless

A pharmaceutical company tests a new blood pressure medication on 50,000 participants. Results:

Treatment group average: 119.5 mmHg systolic
Control group average: 120.0 mmHg systolic
Difference: 0.5 mmHg
p-value: 0.001

The result is highly statistically significant ( $p = 0.001$ ). But the effect size — a 0.5 mmHg reduction in blood pressure — is clinically meaningless. Normal blood pressure varies by more than that from one measurement to the next. No doctor would prescribe a drug for a 0.5 mmHg benefit, regardless of the p-value.

How did such a tiny effect achieve significance? The massive sample size ( $n = 50{,}000$ ). With enough data, even trivially small differences become statistically detectable. The p-value tells you the difference is probably not exactly zero. It does not tell you the difference matters.

The rule: Always ask two questions about any finding. First, is it statistically significant (p-value)? Second, is the effect size large enough to matter in practice? Both must be true for the finding to be actionable.

Effect Size Guidelines

While context-dependent, here are common conventions for interpreting effect sizes (Cohen’s $d$ ):

Effect Size ( $d$ )	Interpretation
0.2	Small effect — detectable but may not be practically important
0.5	Medium effect — noticeable and often meaningful
0.8	Large effect — obvious practical importance

A blood pressure reduction of 0.5 mmHg on a drug with a standard deviation of 15 mmHg would have $d = \frac{0.5}{15} \approx 0.033$ — far below even a “small” effect.

Real-World Application: Nursing — Reading Medical Research Critically

A journal article reports: “New wound care protocol significantly reduces infection rates ( $p = 0.04$ , $n = 60$ ).”

A nurse evaluating this claim should ask:

What was the effect size? If infection rates dropped from 12% to 10%, the absolute reduction is 2 percentage points. That might matter — but with only 60 patients, this translates to a difference of about 1 patient. One patient more or less could be random variation.

Was the study pre-registered? If not, the researchers may have measured multiple outcomes and reported only the one that achieved significance.

Was there a correction for multiple comparisons? If the study also measured healing time, pain scores, patient satisfaction, and readmission rates, finding one significant result out of five is not impressive — the jelly bean problem applies.

What was the comparison? Was the new protocol compared to the current standard of care, or to no treatment? A comparison to no treatment is less informative for clinical decision-making.

Has it been replicated? A single study of 60 patients should change clinical practice only if it is confirmed by larger, independent studies. It is a starting point for further investigation, not a definitive answer.

Who funded it? If the wound care product manufacturer funded the study, the result deserves extra scrutiny — not automatic rejection, but appropriate skepticism.

Bottom line for clinical practice: A single study with $p = 0.04$ and $n = 60$ is suggestive but not conclusive. It should inform further research, not immediately change practice. The evidence hierarchy in medicine places systematic reviews and meta-analyses of multiple studies above any single trial, no matter how elegant.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: A researcher tests 40 different dietary supplements for their effect on mood. Three show statistically significant results at

\alpha = 0.05

. Should these three supplements be considered effective?

First, calculate the expected number of false positives if none of the supplements actually work:

$\text{Expected false positives} = 40 \times 0.05 = 2$

Finding 3 significant results out of 40 tests is very close to what you would expect by chance alone (expected 2). Without a correction for multiple comparisons, these results are not convincing.

Using Bonferroni correction, the adjusted significance threshold would be:

$\alpha_{\text{corrected}} = \frac{0.05}{40} = 0.00125$

Unless the three supplements have p-values below 0.00125, they would not pass the corrected threshold.

Answer: No. With 40 tests at $\alpha = 0.05$ , finding 2-3 “significant” results is expected by chance alone. These are likely false positives from the multiple comparisons problem, not evidence of real effects.

Problem 2: Two studies examine whether a teaching method improves math scores. Study A (

n = 30

) finds a 15-point improvement with

p = 0.03

. Study B (

n = 500

) finds a 2-point improvement with

p = 0.04

. Which study provides more useful evidence?

Both are statistically significant, but they tell very different stories:

Study A has a large effect size (15 points) but a very small sample ( $n = 30$ ). The small sample means the estimate is imprecise — the true effect could be much larger or much smaller. The result is suggestive but unreliable.

Study B has a tiny effect size (2 points) but a large sample ( $n = 500$ ). The large sample gives a precise estimate, but the effect is too small to matter practically. A 2-point improvement on a math test is unlikely to change any student’s outcomes.

Answer: Neither study alone provides compelling evidence for adopting the method. Study A suggests a potentially meaningful effect but needs replication with a larger sample. Study B shows the effect, if real, is too small to be practically significant. The ideal next step is a well-powered study ( $n = 200$ +) to determine whether the true effect is closer to 15 points (meaningful) or 2 points (not meaningful).

Problem 3: A news headline reads: “Scientists prove chocolate prevents heart disease.” The study surveyed 5,000 people about their chocolate consumption and heart health. People who ate chocolate daily had 12% lower rates of heart disease. What questions should you ask?

Critical questions:

Correlation vs. causation: This is an observational survey, not a controlled experiment. People who eat chocolate daily may differ from those who do not in many ways — income, overall diet, exercise habits, stress levels. Any of these confounders could explain the association.
Relative vs. absolute risk: A “12% lower rate” is a relative reduction. If the baseline rate is 5%, the absolute reduction is $0.12 \times 5\% = 0.6$ percentage points (from 5.0% to 4.4%). That is a very small absolute difference.
Multiple comparisons: Did the study also look at coffee, tea, fruit, vegetables, and other foods? If they tested 20 dietary variables, finding one with $p < 0.05$ is expected by chance.
Self-reported data: Dietary surveys rely on people accurately remembering and reporting what they eat, which introduces measurement bias.
Funding: Was the study funded by a chocolate company?

Answer: The word “prove” is a red flag — observational studies cannot prove causation. The 12% figure is likely a relative risk that sounds more dramatic than the absolute risk reduction. Confounding variables, self-report bias, and potential multiple comparisons all undermine the headline’s strong claim.

Problem 4: A pharmaceutical company reports that their new drug is “clinically proven” based on a study where

p = 0.048

with a sample of 45 patients. The study was not pre-registered. Evaluate the strength of this evidence.

Several concerns:

Barely significant p-value: $p = 0.048$ is just under the 0.05 threshold. With researcher degrees of freedom (choice of test, covariates, outcome definition), a p-value this close to the boundary is highly susceptible to p-hacking.
Small sample size: $n = 45$ means the study has low statistical power. If the true effect is modest, a study this small is more likely to either miss it entirely or produce an inflated estimate of the effect size.
Not pre-registered: Without pre-registration, there is no way to verify that the analysis plan was determined before the data was collected. The researchers may have tried multiple analytical approaches and reported the one that worked.
“Clinically proven” is a marketing phrase, not a scientific term. One study with $p = 0.048$ and $n = 45$ does not constitute proof of anything.

Answer: This is weak evidence at best. The barely significant p-value, small sample, and lack of pre-registration raise serious concerns about reliability. A well-powered, pre-registered replication study would be needed before any confidence in this result is warranted.

Problem 5: If you run 10 independent hypothesis tests at

\alpha = 0.05

and all null hypotheses are actually true (no real effects), what is the probability of getting at least one statistically significant result? At least two?

For at least one false positive:

$P(\text{at least one}) = 1 - 0.95^{10} = 1 - 0.5987 = 0.4013 \approx 40.1\%$

For at least two false positives, we use the complement: subtract the probability of zero and exactly one false positive.

$P(\text{exactly zero}) = 0.95^{10} = 0.5987$

$P(\text{exactly one}) = \binom{10}{1} \times 0.05^1 \times 0.95^9 = 10 \times 0.05 \times 0.6302 = 0.3151$

$P(\text{at least two}) = 1 - 0.5987 - 0.3151 = 0.0862 \approx 8.6\%$

Answer: There is approximately a 40.1% chance of at least one false positive and approximately an 8.6% chance of at least two false positives, even when no real effects exist. This demonstrates why running many tests without correction inflates the false discovery rate dramatically.

Key Takeaways

P-hacking is the practice of manipulating analysis choices until a significant p-value is found. It can be intentional or unconscious, and it produces false positives at alarming rates.
The multiple comparisons problem means that running many tests at $\alpha = 0.05$ virtually guarantees some false positives. With 20 tests, there is a 64.2% chance of at least one; with 100 tests, it is 99.4%.
Publication bias (the file drawer problem) ensures that false positives are published while null results are hidden, creating a distorted picture of the evidence.
The reproducibility crisis has shown that large portions of published research fail to replicate — approximately 39% in psychology and 21% in cancer biology.
Statistical significance does not equal practical significance. A p-value of 0.001 with a trivially small effect size means the finding is real but irrelevant.
To evaluate research claims, ask: Was it pre-registered? Has it been replicated? Is the sample large enough? Were multiple comparisons corrected? Is the effect size meaningful? Who funded it?
A single study, no matter how significant its p-value, is the beginning of evidence — not the end. Replication is what turns a finding into a fact.

Return to Statistics for more topics in this section.

Next Up in Statistics

Probability Basics Addition Rule of Probability One-Way ANOVA Bayes' Theorem

All Statistics topics

Last updated: March 29, 2026