Statistics

Simpson's Paradox

Last updated: March 2026 · Intermediate
Before you start

You should be comfortable with:

Real-world applications
💊
Nursing

Medication dosages, IV drip rates, vital monitoring

Simpson’s paradox is one of the most surprising and counterintuitive results in all of statistics. It occurs when a trend that appears clearly in every subgroup of data reverses direction when the subgroups are combined. A treatment can be better in every individual category yet appear worse overall. A group can outperform another in every department yet look worse in the aggregate. It is not a mathematical trick or an edge case — it happens in real data, with real consequences, more often than most people realize.

Understanding Simpson’s paradox is essential for anyone who interprets data, because it reveals a fundamental truth: aggregated data can lie to you even when every individual piece is accurate.

What Is Simpson’s Paradox?

Simpson’s paradox occurs when the direction of a relationship between two variables reverses after a third variable (a confounding variable) is accounted for. The overall numbers point one way, but the subgroup numbers — broken down by the confounding variable — all point the other way.

The paradox arises because the subgroups have different sizes, and the confounding variable determines how individuals are distributed across those subgroups. When you combine the groups, the unequal distribution can overwhelm the actual trend within each group.

The Classic Example: University Admissions

This example is inspired by the famous 1973 Berkeley admissions study, one of the most widely cited cases of Simpson’s paradox in statistics.

Example 1: Gender Bias in Admissions?

A university reports the following overall admission rates:

AppliedAdmittedAdmission Rate
Men1,00052052.0%
Women1,00027027.0%

At first glance, this looks like clear evidence of gender bias — men are admitted at nearly twice the rate of women. But now look at the breakdown by department:

Department A (less competitive, higher admission rate):

AppliedAdmittedAdmission Rate
Men80048060.0%
Women1007070.0%

Department B (more competitive, lower admission rate):

AppliedAdmittedAdmission Rate
Men2004020.0%
Women90020022.2%

In Department A, women have a higher admission rate than men: 70.0% vs. 60.0%.

In Department B, women again have a higher admission rate: 22.2% vs. 20.0%.

Women outperform men in both departments. Yet overall, men appear to be admitted at a much higher rate (52.0% vs. 27.0%).

How Is This Possible?

Let’s verify the overall numbers:

Men overall=480+40800+200=5201,000=52.0%\text{Men overall} = \frac{480 + 40}{800 + 200} = \frac{520}{1{,}000} = 52.0\%

Women overall=70+200100+900=2701,000=27.0%\text{Women overall} = \frac{70 + 200}{100 + 900} = \frac{270}{1{,}000} = 27.0\%

The arithmetic is correct. The resolution lies in where each group applied:

  • Men applied overwhelmingly to Department A (800 out of 1,000), which has a high admission rate for everyone.
  • Women applied overwhelmingly to Department B (900 out of 1,000), which has a low admission rate for everyone.

The confounding variable is department choice. Women were not being discriminated against — they were disproportionately applying to the more competitive department. When this imbalance is aggregated, it reverses the trend that exists within each department.

Why It Happens

Simpson’s paradox occurs whenever:

  1. There is a confounding variable (like department choice) that influences both the grouping variable (gender) and the outcome (admission).
  2. The confounding variable creates unequal subgroup sizes — the groups are distributed differently across subgroups.
  3. The subgroups have very different baseline rates — the “easy” and “hard” departments have very different admission rates.

When you combine the data, the group that is concentrated in the harder category gets “penalized” in the aggregate, even though they outperform within every category.

Mathematically, it is a weighted average problem. The overall rate for each group is a weighted average of the department rates, but the weights are different for each group:

Men overall=8001000×60%+2001000×20%=48%+4%=52%\text{Men overall} = \frac{800}{1000} \times 60\% + \frac{200}{1000} \times 20\% = 48\% + 4\% = 52\%

Women overall=1001000×70%+9001000×200900=7%+20%=27%\text{Women overall} = \frac{100}{1000} \times 70\% + \frac{900}{1000} \times \frac{200}{900} = 7\% + 20\% = 27\%

Women have higher rates in both departments, but their overall average is dragged down because 90% of their “weight” comes from the harder department.

A Medical Example

Simpson’s paradox appears frequently in medical research, where it can have life-or-death implications.

Example 2: Which Treatment Is Better?

A hospital compares two treatments for kidney stones. Here are the overall results:

TreatmentPatientsSuccessesSuccess Rate
Treatment A35027378.0%
Treatment B35028982.6%

Treatment B looks better overall: 82.6% vs. 78.0%. But now break it down by the severity of the kidney stones:

Small stones (easier to treat):

TreatmentPatientsSuccessesSuccess Rate
Treatment A878193.1%
Treatment B27023486.7%

Large stones (harder to treat):

TreatmentPatientsSuccessesSuccess Rate
Treatment A26319273.0%
Treatment B805568.8%

Treatment A wins in both categories — 93.1% vs. 86.7% for small stones, and 73.0% vs. 68.8% for large stones. Yet Treatment B wins overall.

The confounding variable: Stone size. Treatment A was disproportionately assigned to patients with large stones (263 out of 350), which are harder to treat and have lower success rates regardless of treatment. Treatment B was disproportionately assigned to patients with small stones (270 out of 350). When the data is combined, Treatment A’s concentration in the harder cases drags its overall rate down.

Let’s verify:

Treatment A overall=81+19287+263=273350=78.0%\text{Treatment A overall} = \frac{81 + 192}{87 + 263} = \frac{273}{350} = 78.0\%

Treatment B overall=234+55270+80=28935082.6%\text{Treatment B overall} = \frac{234 + 55}{270 + 80} = \frac{289}{350} \approx 82.6\%

The correct conclusion: Treatment A is more effective for both small and large kidney stones. A doctor who looked only at the overall numbers would prescribe the inferior treatment.

Absolute Risk vs. Relative Risk

Simpson’s paradox is closely related to another common source of confusion in statistics: the difference between relative risk and absolute risk. Both are accurate but can create very different impressions.

Example 3: “Treatment Cuts Risk in Half”

Suppose two scenarios are reported as “treatment reduces risk by 50%”:

Scenario A:

  • Baseline risk: 40% (40 out of 100 people get the disease)
  • Treated risk: 20% (20 out of 100 people get the disease)
  • Absolute reduction: 40%20%=2040\% - 20\% = 20 percentage points
  • Relative reduction: 40%20%40%=50%\frac{40\% - 20\%}{40\%} = 50\%

Scenario B:

  • Baseline risk: 2% (2 out of 100 people get the disease)
  • Treated risk: 1% (1 out of 100 people get the disease)
  • Absolute reduction: 2%1%=12\% - 1\% = 1 percentage point
  • Relative reduction: 2%1%2%=50%\frac{2\% - 1\%}{2\%} = 50\%

Both scenarios have the same “50% risk reduction.” But in Scenario A, the treatment prevents 20 cases per 100 people. In Scenario B, it prevents 1 case per 100 people. The Number Needed to Treat makes this concrete:

Scenario A: NNT=10.20=5(treat 5 people to prevent 1 case)\text{Scenario A: NNT} = \frac{1}{0.20} = 5 \quad \text{(treat 5 people to prevent 1 case)}

Scenario B: NNT=10.01=100(treat 100 people to prevent 1 case)\text{Scenario B: NNT} = \frac{1}{0.01} = 100 \quad \text{(treat 100 people to prevent 1 case)}

Media outlets and pharmaceutical companies tend to use relative risk because “50% reduction” is more headline-worthy than “1 percentage point reduction.” Neither is wrong, but the absolute number gives a more honest picture of the practical impact.

Always ask: What is the baseline risk? What is the absolute reduction? How many people need to be treated for one person to benefit?

How to Protect Yourself

Simpson’s paradox cannot be detected from aggregate numbers alone. You need to think about whether meaningful subgroups exist in the data and whether those subgroups might be distributed unevenly.

Here is a practical framework:

  1. Always ask: are there meaningful subgroups? When comparing two groups (treatments, schools, hospitals, policies), consider whether a lurking variable could be dividing the data in a way that affects the outcome.
  2. Check if subgroup sizes are balanced. If Group A is heavily concentrated in one category and Group B in another, aggregated comparisons are unreliable.
  3. If overall and subgroup results conflict, trust the subgroup analysis — provided the subgroups are genuine and the confounding variable is real, not manufactured by data dredging.
  4. Look for confounding variables that could influence both the grouping and the outcome: severity of illness, difficulty of department, socioeconomic status, age, or any other factor that creates unequal starting conditions.

Real-World Application: Nursing — Comparing Hospital Mortality Rates

Hospital rankings often rely on overall mortality rates, but Simpson’s paradox makes raw comparisons between hospitals deeply misleading.

The scenario: Hospital A has an overall mortality rate of 4.2%. Hospital B has an overall mortality rate of 3.1%. A newspaper publishes a story ranking Hospital B as “safer.”

But consider the patient mix:

Hospital A is a major trauma center and tertiary referral hospital. It receives the most severely ill patients from across the region — complex surgeries, advanced cancers, multi-organ failure.

Hospital B is a community hospital that handles routine procedures and low-acuity cases. Severely ill patients are transferred out to Hospital A.

When broken down by patient acuity:

Acuity LevelHospital A MortalityHospital B Mortality
Low acuity0.5%0.8%
Medium acuity3.0%3.5%
High acuity12.0%15.0%

Hospital A has a lower mortality rate than Hospital B at every acuity level. But because Hospital A treats a much higher proportion of high-acuity patients, its overall mortality rate is dragged up.

This is why modern hospital quality metrics use risk-adjusted mortality rates that account for patient severity. The raw overall number is not just unhelpful — it can be actively misleading, punishing hospitals that take on the hardest cases.

For nurses and healthcare professionals: When you see hospital comparison data, always ask whether the numbers are risk-adjusted. A hospital with a higher overall mortality rate may actually be providing better care — they are just treating sicker patients.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: A company has two divisions. In Division X, women earn an average of $72,000 and men earn $70,000. In Division Y, women earn $52,000 and men earn $50,000. Women earn more in both divisions. But the company-wide average shows men earning more than women. How is this possible?

This is Simpson’s paradox. The confounding variable is division assignment. If most men work in Division X (the higher-paying division) while most women work in Division Y (the lower-paying division), the company-wide average for men is pulled up by the higher-paying division.

For example, if 80% of men are in Division X and 80% of women are in Division Y:

Men’s average=0.80×70,000+0.20×50,000=56,000+10,000=$66,000\text{Men's average} = 0.80 \times 70{,}000 + 0.20 \times 50{,}000 = 56{,}000 + 10{,}000 = \$66{,}000

Women’s average=0.20×72,000+0.80×52,000=14,400+41,600=$56,000\text{Women's average} = 0.20 \times 72{,}000 + 0.80 \times 52{,}000 = 14{,}400 + 41{,}600 = \$56{,}000

Answer: Men earn more overall ($66,000 vs. $56,000) despite women earning more in both divisions, because men are disproportionately concentrated in the higher-paying division.

Problem 2: A study compares two medications for headache relief. Overall, Drug A has a 70% success rate and Drug B has a 75% success rate. When the data is split by headache severity (mild vs. severe), Drug A has a higher success rate in both categories. Which drug should a doctor prescribe, and why?

The doctor should prescribe Drug A. Despite the overall numbers favoring Drug B, the subgroup analysis shows Drug A is more effective for both mild and severe headaches.

The overall numbers are misleading because of Simpson’s paradox — Drug A was likely tested more heavily on severe headaches (which have lower success rates regardless of treatment), dragging its overall rate down.

Answer: Prescribe Drug A. The subgroup analysis (controlling for severity) gives the correct picture. The overall numbers are confounded by the uneven distribution of severity between the two groups.

Problem 3: A disease has a baseline risk of 8 per 1,000 (0.8%). A new treatment reduces the risk to 4 per 1,000 (0.4%). The marketing materials say “50% risk reduction.” Calculate the absolute risk reduction and the Number Needed to Treat.

The relative risk reduction is:

0.8%0.4%0.8%=50%\frac{0.8\% - 0.4\%}{0.8\%} = 50\%

The absolute risk reduction is:

0.8%0.4%=0.4 percentage points0.8\% - 0.4\% = 0.4 \text{ percentage points}

The Number Needed to Treat is:

NNT=10.004=250\text{NNT} = \frac{1}{0.004} = 250

Answer: The absolute risk reduction is 0.4 percentage points (from 0.8% to 0.4%). The NNT is 250, meaning 250 people need to be treated for 1 person to benefit. The “50% reduction” is technically true but obscures the fact that the absolute benefit is very small.

Problem 4: School A has a higher overall test pass rate than School B. But when broken down by income level (low-income and high-income students), School B has a higher pass rate in both subgroups. What is the likely confounding variable, and what should you conclude?

The confounding variable is student income distribution. School A likely has a much higher proportion of high-income students (who tend to have higher pass rates regardless of school quality). School B likely serves a higher proportion of low-income students.

When the groups are combined, School A’s overall rate is boosted by its concentration of high-income students, even though School B does a better job with both subgroups.

Answer: The confounding variable is the proportion of low-income vs. high-income students at each school. School B provides better educational outcomes for both income groups, but School A’s more affluent student body produces a higher aggregate number.

Problem 5: A study of 600 patients compares Surgery vs. Medication for a condition. Overall: Surgery succeeds 80% of the time (240/300) and Medication succeeds 83.3% of the time (250/300). Should you conclude Medication is better? What additional information would you need?

You should not conclude that Medication is better based on overall numbers alone. Simpson’s paradox could be at work — the patients assigned to Surgery might have been sicker or had more advanced cases.

Additional information needed:

  1. Patient severity/staging — were the groups comparable at baseline?
  2. How were patients assigned to each treatment — was it randomized or based on clinical judgment?
  3. Subgroup success rates — what are the success rates when broken down by disease severity?
  4. Confounding variables — age, other conditions, and other factors that could affect outcomes

Answer: Do not conclude Medication is better without examining subgroup data controlling for patient severity. If sicker patients were more likely to receive surgery, the overall comparison is confounded. This is exactly the scenario where Simpson’s paradox operates.

Key Takeaways

  • Simpson’s paradox occurs when a trend present in every subgroup reverses when the data is combined — it is a real and common phenomenon, not a mathematical curiosity.
  • The paradox arises from confounding variables that create unequal distributions across subgroups: department choice, disease severity, income level, or any factor that affects both grouping and outcome.
  • Aggregated data can be misleading even when every individual number is correct. Always ask whether meaningful subgroups exist in the data.
  • When overall and subgroup results conflict, the subgroup analysis is usually more reliable — it controls for the confounding variable.
  • Relative risk (“50% reduction!”) can be technically true but practically misleading. Always ask for the absolute risk and Number Needed to Treat.
  • In healthcare, risk-adjusted comparisons are essential — raw hospital mortality rates punish institutions that treat the sickest patients.
  • The best defense against Simpson’s paradox is the habit of asking: “What happens when I break this data down by [relevant subgroup]?”

Return to Statistics for more topics in this section.

Last updated: March 29, 2026