Statistics

Two-Sample Tests

Last updated: March 2026 · Advanced
Before you start

You should be comfortable with:

Real-world applications
💊
Nursing

Medication dosages, IV drip rates, vital monitoring

Many real-world questions are about comparing two groups. Is a new drug more effective than the standard treatment? Do students who attend tutoring score higher than those who do not? Is there a difference in defect rates between two factories? Single-sample tests cannot answer these questions — you need a test that compares two sets of data. In this lesson, you will learn three such tests: the two-sample t-test for comparing independent means, the two-proportion z-test for comparing independent proportions, and the paired t-test for matched-pairs data. Together, these cover the vast majority of two-group comparisons you will encounter.

Two-Sample T-Test for Independent Means

Use this test when you have two independent groups and you want to compare their population means. “Independent” means the subjects in one group have no connection to the subjects in the other — for example, a treatment group and a control group with different people in each.

Test statistic (Welch’s t-test):

t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}}

where xˉ1,xˉ2\bar{x}_1, \bar{x}_2 are the sample means, s1,s2s_1, s_2 are the sample standard deviations, and n1,n2n_1, n_2 are the sample sizes.

Degrees of freedom: The exact degrees of freedom for Welch’s t-test use the Welch-Satterthwaite formula, which is complex. In practice, use software to compute it. For hand calculations, a conservative approximation is df=min(n11,n21)df = \min(n_1 - 1, n_2 - 1), which gives a slightly larger p-value than the exact method (a safe, conservative approach).

Conditions:

  • Both samples are random and independent of each other
  • Observations within each sample are independent
  • Both populations are approximately normal, OR both sample sizes are large (n130n_1 \geq 30 and n230n_2 \geq 30)
  • No extreme outliers, especially for small samples

Example 1: Treatment vs Control

A medical researcher tests a new cognitive training program. The treatment group (n1=25n_1 = 25) completes the program and scores xˉ1=82\bar{x}_1 = 82 with s1=10s_1 = 10 on a standardized test. The control group (n2=30n_2 = 30) receives no training and scores xˉ2=76\bar{x}_2 = 76 with s2=12s_2 = 12. Is there evidence that the treatment group scores higher? Test at α=0.05\alpha = 0.05.

Step 1: State the hypotheses.

H0:μ1μ2=0(no difference between groups)H_0: \mu_1 - \mu_2 = 0 \quad \text{(no difference between groups)}

Ha:μ1μ2>0(treatment group scores higher — one-sided right)H_a: \mu_1 - \mu_2 > 0 \quad \text{(treatment group scores higher — one-sided right)}

Step 2: Choose significance level: α=0.05\alpha = 0.05.

Step 3: Check conditions. Independent random assignment ✓. Both sample sizes are moderate (25 and 30) ✓. Test scores are typically approximately normal ✓.

Step 4: Calculate the test statistic.

SE=s12n1+s22n2=10225+12230=10025+14430=4+4.8=8.8=2.966SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{10^2}{25} + \frac{12^2}{30}} = \sqrt{\frac{100}{25} + \frac{144}{30}} = \sqrt{4 + 4.8} = \sqrt{8.8} = 2.966

t=82762.966=62.966=2.023t = \frac{82 - 76}{2.966} = \frac{6}{2.966} = 2.023

Step 5: Find the p-value. Using conservative df=min(24,29)=24df = \min(24, 29) = 24:

p-value=P(t>2.023 with df=24)0.027\text{p-value} = P(t > 2.023 \text{ with } df = 24) \approx 0.027

Step 6: Make a decision. Since 0.027<0.050.027 < 0.05, we reject H0H_0.

Step 7: Conclusion in context. There is statistically significant evidence that the cognitive training program improves test scores. The treatment group scored an average of 6 points higher than the control group. This difference is both statistically significant and large enough to suggest the program has a meaningful effect on cognitive performance.

Why Welch’s T-Test?

You may encounter two versions of the two-sample t-test: the pooled (equal variance) version and the Welch (unequal variance) version. Welch’s t-test does not assume the two populations have equal standard deviations. Since this assumption is hard to verify and often violated, Welch’s t-test is the safer default. Most modern software uses Welch’s version by default. When the standard deviations are actually equal, both versions give nearly identical results — so nothing is lost by using Welch’s test.

Two-Proportion Z-Test

Use this test when you want to compare proportions from two independent groups — for example, the success rates of two treatments, the defect rates of two production lines, or the pass rates of two schools.

Pooled proportion (used under H0:p1=p2H_0: p_1 = p_2):

p^=x1+x2n1+n2\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}

where x1x_1 and x2x_2 are the number of successes in each group.

Test statistic:

z=p^1p^2p^(1p^)(1n1+1n2)z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right)}}

The pooled proportion is used because under H0H_0, we assume the two population proportions are equal. The best single estimate of that common proportion combines both samples.

Conditions:

  • Both samples are independent random samples
  • Observations within each sample are independent
  • Large sample condition: n1p^10n_1\hat{p} \geq 10, n1(1p^)10n_1(1-\hat{p}) \geq 10, n2p^10n_2\hat{p} \geq 10, and n2(1p^)10n_2(1-\hat{p}) \geq 10 (using the pooled p^\hat{p})

Example 2: Drug Side Effects

A pharmaceutical company is comparing the side-effect rates of two drugs. Drug A is given to 200 patients, and 18 experience side effects. Drug B is given to 250 patients, and 30 experience side effects. Is there evidence that Drug A has a lower side-effect rate? Test at α=0.05\alpha = 0.05.

Step 1: State the hypotheses.

H0:p1=p2(side-effect rates are the same)H_0: p_1 = p_2 \quad \text{(side-effect rates are the same)}

Ha:p1<p2(Drug A has a lower rate — one-sided left)H_a: p_1 < p_2 \quad \text{(Drug A has a lower rate — one-sided left)}

Step 2: Choose significance level: α=0.05\alpha = 0.05.

Step 3: Calculate the pooled proportion and check conditions.

p^1=18200=0.09p^2=30250=0.12\hat{p}_1 = \frac{18}{200} = 0.09 \qquad \hat{p}_2 = \frac{30}{250} = 0.12

p^=18+30200+250=48450=0.1067\hat{p} = \frac{18 + 30}{200 + 250} = \frac{48}{450} = 0.1067

Large sample check (using pooled p^=0.1067\hat{p} = 0.1067): 200×0.1067=21.310200 \times 0.1067 = 21.3 \geq 10 ✓, 200×0.8933=178.710200 \times 0.8933 = 178.7 \geq 10 ✓, 250×0.1067=26.710250 \times 0.1067 = 26.7 \geq 10 ✓, 250×0.8933=223.310250 \times 0.8933 = 223.3 \geq 10 ✓.

Step 4: Calculate the test statistic.

SE=0.1067×0.8933×(1200+1250)=0.09531×0.009=0.000858=0.02929SE = \sqrt{0.1067 \times 0.8933 \times \left(\frac{1}{200} + \frac{1}{250}\right)} = \sqrt{0.09531 \times 0.009} = \sqrt{0.000858} = 0.02929

z=0.090.120.02929=0.030.02929=1.024z = \frac{0.09 - 0.12}{0.02929} = \frac{-0.03}{0.02929} = -1.024

Step 5: Find the p-value. One-sided left:

p-value=P(Z<1.024)0.153\text{p-value} = P(Z < -1.024) \approx 0.153

Step 6: Make a decision. Since 0.153>0.050.153 > 0.05, we fail to reject H0H_0.

Step 7: Conclusion in context. There is not sufficient evidence to conclude that Drug A has a lower side-effect rate than Drug B. While the sample rates differ (9% vs 12%), this 3-percentage-point difference could reasonably be due to random variation. A larger study would be needed to detect such a small difference if it truly exists.

Paired T-Test (Matched Pairs)

The paired t-test is used when you have two measurements on the same subjects — before and after a treatment, left eye and right eye, or any other naturally paired design. Because the two measurements are on the same individual, they are not independent of each other, so you cannot use a two-sample t-test. Instead, you compute the difference for each pair and then perform a one-sample t-test on those differences.

Procedure:

  1. For each pair, calculate the difference: di=xafterxbefored_i = x_{\text{after}} - x_{\text{before}} (or whichever direction is relevant)
  2. Compute dˉ\bar{d} (the mean of the differences) and sds_d (the standard deviation of the differences)
  3. Test statistic: t=dˉsd/nt = \frac{\bar{d}}{s_d / \sqrt{n}} with df=n1df = n - 1, where nn is the number of pairs

Conditions:

  • The pairs are randomly selected or randomly assigned
  • The differences are independent of each other (one pair does not influence another)
  • The population of differences is approximately normal, OR n30n \geq 30

Example 3: Blood Pressure Before and After Medication

A physician measures the systolic blood pressure of 12 patients before and after taking a new medication. The differences (after minus before) are:

8,5,12,3,7,10,4,6,9,1,11,6-8, -5, -12, -3, -7, -10, -4, -6, -9, -1, -11, -6

Is there evidence the medication lowers blood pressure? Test at α=0.01\alpha = 0.01.

Step 1: State the hypotheses. Since negative differences mean blood pressure decreased:

H0:μd=0(no change in blood pressure)H_0: \mu_d = 0 \quad \text{(no change in blood pressure)}

Ha:μd<0(blood pressure decreased — one-sided left)H_a: \mu_d < 0 \quad \text{(blood pressure decreased — one-sided left)}

Step 2: Choose significance level: α=0.01\alpha = 0.01.

Step 3: Check conditions. Random sample of patients ✓. Differences are independent (different patients) ✓. With n=12n = 12, we need approximate normality of the differences — the values appear roughly symmetric with no extreme outliers ✓.

Step 4: Calculate dˉ\bar{d} and sds_d.

Sum=8+(5)+(12)+(3)+(7)+(10)+(4)+(6)+(9)+(1)+(11)+(6)=82\text{Sum} = -8 + (-5) + (-12) + (-3) + (-7) + (-10) + (-4) + (-6) + (-9) + (-1) + (-11) + (-6) = -82

dˉ=8212=6.833\bar{d} = \frac{-82}{12} = -6.833

To find sds_d, compute each deviation from the mean and square it:

did_ididˉd_i - \bar{d}(didˉ)2(d_i - \bar{d})^2
8-81.167-1.1671.3621.362
5-51.8331.8333.3603.360
12-125.167-5.16726.69826.698
3-33.8333.83314.69214.692
7-70.167-0.1670.0280.028
10-103.167-3.16710.03010.030
4-42.8332.8338.0268.026
6-60.8330.8330.6940.694
9-92.167-2.1674.6964.696
1-15.8335.83334.02434.024
11-114.167-4.16717.36417.364
6-60.8330.8330.6940.694

(didˉ)2=121.668\sum (d_i - \bar{d})^2 = 121.668

sd2=121.668121=121.66811=11.061s_d^2 = \frac{121.668}{12 - 1} = \frac{121.668}{11} = 11.061

sd=11.061=3.326s_d = \sqrt{11.061} = 3.326

Now calculate the test statistic:

SE=sdn=3.32612=3.3263.464=0.9602SE = \frac{s_d}{\sqrt{n}} = \frac{3.326}{\sqrt{12}} = \frac{3.326}{3.464} = 0.9602

t=6.8330.9602=7.116t = \frac{-6.833}{0.9602} = -7.116

Step 5: Find the p-value. With df=11df = 11 and a one-sided left test:

p-value=P(t<7.116 with df=11)0.00001\text{p-value} = P(t < -7.116 \text{ with } df = 11) \approx 0.00001

The p-value is extremely small — well below any conventional significance level.

Step 6: Make a decision. Since 0.00001<0.010.00001 < 0.01, we reject H0H_0.

Step 7: Conclusion in context. There is overwhelming evidence that the medication lowers systolic blood pressure. The average reduction of 6.8 mmHg is both statistically significant and clinically meaningful. Every single patient in the sample showed a decrease, and the magnitude of the average drop exceeds the threshold that physicians consider relevant for cardiovascular risk reduction.

Why Paired Instead of Two-Sample?

You might wonder why we do not simply compare the “before” group and the “after” group with a two-sample t-test. The answer is that the paired design is far more powerful. By computing differences within each subject, you eliminate the large person-to-person variability in baseline blood pressure. Patient A might have a baseline of 160 and Patient B might have a baseline of 120, but the medication effect within each patient can be isolated. The standard deviation of the differences (sd=3.326s_d = 3.326) is much smaller than the standard deviation of the raw measurements would be, giving a much larger test statistic and a much smaller p-value.

Choosing the Right Two-Sample Test

SituationTestKey Formula
Comparing means of two independent groups, σ\sigma unknownTwo-sample t-test (Welch)t=xˉ1xˉ2s12/n1+s22/n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}
Comparing proportions of two independent groupsTwo-proportion z-testz=p^1p^2p^(1p^)(1/n1+1/n2)z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1 + 1/n_2)}}
Comparing two measurements on the same subjectsPaired t-testt=dˉsd/nt = \frac{\bar{d}}{s_d / \sqrt{n}}

Decision process:

  1. Are you comparing means or proportions?
    • Means: go to step 2
    • Proportions: use the two-proportion z-test
  2. Are the two groups independent or paired?
    • Independent (different subjects in each group): use the two-sample t-test
    • Paired (same subjects measured twice, or naturally matched pairs): use the paired t-test

A common mistake is using a two-sample t-test when the data is actually paired. This ignores the pairing structure and loses statistical power. Always ask: “Are the two measurements connected to the same individual?” If yes, use the paired t-test.

Real-World Application: Nursing — Comparing Treatment Protocols

A hospital evaluates two wound-care protocols. Protocol A (the current standard) is used on 35 randomly assigned patients, with a mean healing time of xˉ1=14.2\bar{x}_1 = 14.2 days and s1=3.8s_1 = 3.8 days. Protocol B (a new approach) is used on 40 patients, with xˉ2=12.5\bar{x}_2 = 12.5 days and s2=4.1s_2 = 4.1 days. Is Protocol B faster? Test at α=0.05\alpha = 0.05.

H0:μ1μ2=0Ha:μ1μ2>0H_0: \mu_1 - \mu_2 = 0 \qquad H_a: \mu_1 - \mu_2 > 0

(Note: HaH_a tests whether Protocol A takes longer, meaning Protocol B is faster.)

SE=3.8235+4.1240=14.4435+16.8140=0.4126+0.4203=0.8329=0.9126SE = \sqrt{\frac{3.8^2}{35} + \frac{4.1^2}{40}} = \sqrt{\frac{14.44}{35} + \frac{16.81}{40}} = \sqrt{0.4126 + 0.4203} = \sqrt{0.8329} = 0.9126

t=14.212.50.9126=1.70.9126=1.863t = \frac{14.2 - 12.5}{0.9126} = \frac{1.7}{0.9126} = 1.863

Using conservative df=min(34,39)=34df = \min(34, 39) = 34:

p-value=P(t>1.863,df=34)0.036\text{p-value} = P(t > 1.863, df = 34) \approx 0.036

Since 0.036<0.050.036 < 0.05, reject H0H_0. There is evidence Protocol B produces faster healing. The 1.7-day average improvement is clinically relevant: fewer inpatient days, lower infection risk, and reduced treatment costs. The nursing team should consider transitioning to Protocol B, with continued monitoring to confirm the results hold across a broader patient population.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: A school district compares math scores between two teaching methods. Method 1 (n1=28n_1 = 28): xˉ1=78\bar{x}_1 = 78, s1=9s_1 = 9. Method 2 (n2=32n_2 = 32): xˉ2=73\bar{x}_2 = 73, s2=11s_2 = 11. Is there evidence that Method 1 is better? (α=0.05\alpha = 0.05)

H0:μ1μ2=0H_0: \mu_1 - \mu_2 = 0, Ha:μ1μ2>0H_a: \mu_1 - \mu_2 > 0 (one-sided right)

SE=9228+11232=8128+12132=2.893+3.781=6.674=2.583SE = \sqrt{\frac{9^2}{28} + \frac{11^2}{32}} = \sqrt{\frac{81}{28} + \frac{121}{32}} = \sqrt{2.893 + 3.781} = \sqrt{6.674} = 2.583

t=78732.583=52.583=1.936t = \frac{78 - 73}{2.583} = \frac{5}{2.583} = 1.936

Conservative df=min(27,31)=27df = \min(27, 31) = 27. P(t>1.936,df=27)0.032P(t > 1.936, df = 27) \approx 0.032.

Since 0.032<0.050.032 < 0.05, reject H0H_0.

Answer: There is statistically significant evidence that Method 1 produces higher math scores than Method 2. The 5-point average difference is statistically significant.

Problem 2: A marketing team tests two ad designs. Ad A: 45 clicks out of 500 impressions. Ad B: 60 clicks out of 500 impressions. Is Ad B’s click rate higher? (α=0.05\alpha = 0.05)

H0:p1=p2H_0: p_1 = p_2, Ha:p1<p2H_a: p_1 < p_2 (one-sided, testing if Ad B is higher means Ad A is lower)

p^1=45500=0.09p^2=60500=0.12\hat{p}_1 = \frac{45}{500} = 0.09 \qquad \hat{p}_2 = \frac{60}{500} = 0.12

p^=45+60500+500=1051000=0.105\hat{p} = \frac{45 + 60}{500 + 500} = \frac{105}{1000} = 0.105

SE=0.105×0.895×(1500+1500)=0.09398×0.004=0.000376=0.01939SE = \sqrt{0.105 \times 0.895 \times \left(\frac{1}{500} + \frac{1}{500}\right)} = \sqrt{0.09398 \times 0.004} = \sqrt{0.000376} = 0.01939

z=0.090.120.01939=0.030.01939=1.547z = \frac{0.09 - 0.12}{0.01939} = \frac{-0.03}{0.01939} = -1.547

P(Z<1.547)0.061P(Z < -1.547) \approx 0.061.

Since 0.061>0.050.061 > 0.05, fail to reject H0H_0.

Answer: There is not sufficient evidence at the 5% level that Ad B has a higher click-through rate. The observed difference of 3 percentage points could be due to chance. A larger sample size would increase the power to detect such a difference.

Problem 3: Ten runners are timed on a 5K before and after a training program. The differences (after minus before, in seconds) are: 15,8,22,3,12,18,5,10,7,14-15, -8, -22, 3, -12, -18, -5, -10, -7, -14. Is there evidence the training reduces 5K time? (α=0.05\alpha = 0.05)

H0:μd=0H_0: \mu_d = 0, Ha:μd<0H_a: \mu_d < 0 (one-sided left)

Sum=15+(8)+(22)+3+(12)+(18)+(5)+(10)+(7)+(14)=108\text{Sum} = -15 + (-8) + (-22) + 3 + (-12) + (-18) + (-5) + (-10) + (-7) + (-14) = -108

dˉ=10810=10.8\bar{d} = \frac{-108}{10} = -10.8

Deviations from mean: 4.2,2.8,11.2,13.8,1.2,7.2,5.8,0.8,3.8,3.2-4.2, 2.8, -11.2, 13.8, -1.2, -7.2, 5.8, 0.8, 3.8, -3.2

Squared deviations: 17.64,7.84,125.44,190.44,1.44,51.84,33.64,0.64,14.44,10.2417.64, 7.84, 125.44, 190.44, 1.44, 51.84, 33.64, 0.64, 14.44, 10.24

=453.60sd2=453.609=50.40sd=7.099\sum = 453.60 \qquad s_d^2 = \frac{453.60}{9} = 50.40 \qquad s_d = 7.099

SE=7.09910=7.0993.162=2.245SE = \frac{7.099}{\sqrt{10}} = \frac{7.099}{3.162} = 2.245

t=10.82.245=4.811t = \frac{-10.8}{2.245} = -4.811

With df=9df = 9, P(t<4.811)0.0005P(t < -4.811) \approx 0.0005.

Since 0.0005<0.050.0005 < 0.05, reject H0H_0.

Answer: There is strong evidence the training program reduces 5K time. The average improvement of 10.8 seconds is highly statistically significant. Nine of the ten runners improved.

Problem 4: Two hospitals compare infection rates. Hospital X: 12 infections out of 300 surgeries. Hospital Y: 20 infections out of 350 surgeries. Is there a difference in infection rates? (α=0.05\alpha = 0.05, two-sided)

H0:p1=p2H_0: p_1 = p_2, Ha:p1p2H_a: p_1 \neq p_2

p^1=12300=0.04p^2=20350=0.05714\hat{p}_1 = \frac{12}{300} = 0.04 \qquad \hat{p}_2 = \frac{20}{350} = 0.05714

p^=12+20300+350=32650=0.04923\hat{p} = \frac{12 + 20}{300 + 350} = \frac{32}{650} = 0.04923

SE=0.04923×0.95077×(1300+1350)=0.04681×0.006190=0.000290=0.01703SE = \sqrt{0.04923 \times 0.95077 \times \left(\frac{1}{300} + \frac{1}{350}\right)} = \sqrt{0.04681 \times 0.006190} = \sqrt{0.000290} = 0.01703

z=0.040.057140.01703=0.017140.01703=1.006z = \frac{0.04 - 0.05714}{0.01703} = \frac{-0.01714}{0.01703} = -1.006

Two-sided p-value =2×P(Z<1.006)2×0.157=0.314= 2 \times P(Z < -1.006) \approx 2 \times 0.157 = 0.314.

Since 0.314>0.050.314 > 0.05, fail to reject H0H_0.

Answer: There is not sufficient evidence of a difference in infection rates between the two hospitals. The observed difference (4% vs 5.7%) is not statistically significant.

Problem 5: A dietitian measures cholesterol levels of 20 patients before and after a 12-week diet program. The mean difference (after minus before) is dˉ=18.5\bar{d} = -18.5 mg/dL with sd=22.0s_d = 22.0 mg/dL. Test whether the diet reduces cholesterol at α=0.05\alpha = 0.05.

H0:μd=0H_0: \mu_d = 0, Ha:μd<0H_a: \mu_d < 0 (one-sided left)

SE=22.020=22.04.472=4.920SE = \frac{22.0}{\sqrt{20}} = \frac{22.0}{4.472} = 4.920

t=18.54.920=3.760t = \frac{-18.5}{4.920} = -3.760

With df=19df = 19, P(t<3.760)0.0007P(t < -3.760) \approx 0.0007.

Since 0.0007<0.050.0007 < 0.05, reject H0H_0.

Answer: There is strong evidence that the diet program reduces cholesterol. The average decrease of 18.5 mg/dL is both statistically significant and clinically meaningful — a reduction of this magnitude is associated with meaningful cardiovascular risk reduction.

Key Takeaways

  • The two-sample t-test (Welch’s) compares the means of two independent groups using t=xˉ1xˉ2s12/n1+s22/n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}
  • The two-proportion z-test compares proportions from two independent groups using a pooled proportion under H0H_0
  • The paired t-test compares two measurements on the same subjects by analyzing the differences: t=dˉsd/nt = \frac{\bar{d}}{s_d/\sqrt{n}}
  • Choosing the right test depends on two questions: (1) Are you comparing means or proportions? (2) Are the groups independent or paired?
  • Using a two-sample test when data is actually paired wastes statistical power — the pairing removes between-subject variability
  • Welch’s t-test is preferred over the pooled t-test because it does not assume equal variances
  • For the two-proportion z-test, always use the pooled proportion when calculating the standard error under H0H_0
  • In clinical settings, two-sample tests are essential for comparing treatment protocols, evaluating new interventions, and making evidence-based decisions about patient care
  • As with all hypothesis tests, consider both statistical significance and practical significance — a statistically significant difference may or may not be large enough to matter in practice

Return to Statistics for more topics in this section.

Last updated: March 29, 2026