Statistics

Two-Sample Tests

Last updated: March 2026 · Advanced

Before you start

You should be comfortable with:

One-Sample T-Test One-Sample Z-Tests

Real-world applications

💊

Nursing

Medication dosages, IV drip rates, vital monitoring

Many real-world questions are about comparing two groups. Is a new drug more effective than the standard treatment? Do students who attend tutoring score higher than those who do not? Is there a difference in defect rates between two factories? Single-sample tests cannot answer these questions — you need a test that compares two sets of data. In this lesson, you will learn three such tests: the two-sample t-test for comparing independent means, the two-proportion z-test for comparing independent proportions, and the paired t-test for matched-pairs data. Together, these cover the vast majority of two-group comparisons you will encounter.

Two-Sample T-Test for Independent Means

Use this test when you have two independent groups and you want to compare their population means. “Independent” means the subjects in one group have no connection to the subjects in the other — for example, a treatment group and a control group with different people in each.

Test statistic (Welch’s t-test):

$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}}$

where $\bar{x}_1, \bar{x}_2$ are the sample means, $s_1, s_2$ are the sample standard deviations, and $n_1, n_2$ are the sample sizes.

Degrees of freedom: The exact degrees of freedom for Welch’s t-test use the Welch-Satterthwaite formula, which is complex. In practice, use software to compute it. For hand calculations, a conservative approximation is $df = \min(n_1 - 1, n_2 - 1)$ , which gives a slightly larger p-value than the exact method (a safe, conservative approach).

Conditions:

Both samples are random and independent of each other
Observations within each sample are independent
Both populations are approximately normal, OR both sample sizes are large ( $n_1 \geq 30$ and $n_2 \geq 30$ )
No extreme outliers, especially for small samples

Example 1: Treatment vs Control

A medical researcher tests a new cognitive training program. The treatment group ( $n_1 = 25$ ) completes the program and scores $\bar{x}_1 = 82$ with $s_1 = 10$ on a standardized test. The control group ( $n_2 = 30$ ) receives no training and scores $\bar{x}_2 = 76$ with $s_2 = 12$ . Is there evidence that the treatment group scores higher? Test at $\alpha = 0.05$ .

Step 1: State the hypotheses.

$H_0: \mu_1 - \mu_2 = 0 \quad \text{(no difference between groups)}$

$H_a: \mu_1 - \mu_2 > 0 \quad \text{(treatment group scores higher — one-sided right)}$

Step 2: Choose significance level: $\alpha = 0.05$ .

Step 3: Check conditions. Independent random assignment ✓. Both sample sizes are moderate (25 and 30) ✓. Test scores are typically approximately normal ✓.

Step 4: Calculate the test statistic.

$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{10^2}{25} + \frac{12^2}{30}} = \sqrt{\frac{100}{25} + \frac{144}{30}} = \sqrt{4 + 4.8} = \sqrt{8.8} = 2.966$

$t = \frac{82 - 76}{2.966} = \frac{6}{2.966} = 2.023$

Step 5: Find the p-value. Using conservative $df = \min(24, 29) = 24$ :

$\text{p-value} = P(t > 2.023 \text{ with } df = 24) \approx 0.027$

Step 6: Make a decision. Since $0.027 < 0.05$ , we reject $H_0$ .

Step 7: Conclusion in context. There is statistically significant evidence that the cognitive training program improves test scores. The treatment group scored an average of 6 points higher than the control group. This difference is both statistically significant and large enough to suggest the program has a meaningful effect on cognitive performance.

Why Welch’s T-Test?

You may encounter two versions of the two-sample t-test: the pooled (equal variance) version and the Welch (unequal variance) version. Welch’s t-test does not assume the two populations have equal standard deviations. Since this assumption is hard to verify and often violated, Welch’s t-test is the safer default. Most modern software uses Welch’s version by default. When the standard deviations are actually equal, both versions give nearly identical results — so nothing is lost by using Welch’s test.

Two-Proportion Z-Test

Use this test when you want to compare proportions from two independent groups — for example, the success rates of two treatments, the defect rates of two production lines, or the pass rates of two schools.

Pooled proportion (used under $H_0: p_1 = p_2$ ):

$\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$

where $x_1$ and $x_2$ are the number of successes in each group.

Test statistic:

$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right)}}$

The pooled proportion is used because under $H_0$ , we assume the two population proportions are equal. The best single estimate of that common proportion combines both samples.

Conditions:

Both samples are independent random samples
Observations within each sample are independent
Large sample condition: $n_1\hat{p} \geq 10$ , $n_1(1-\hat{p}) \geq 10$ , $n_2\hat{p} \geq 10$ , and $n_2(1-\hat{p}) \geq 10$ (using the pooled $\hat{p}$ )

Example 2: Drug Side Effects

A pharmaceutical company is comparing the side-effect rates of two drugs. Drug A is given to 200 patients, and 18 experience side effects. Drug B is given to 250 patients, and 30 experience side effects. Is there evidence that Drug A has a lower side-effect rate? Test at $\alpha = 0.05$ .

Step 1: State the hypotheses.

$H_0: p_1 = p_2 \quad \text{(side-effect rates are the same)}$

$H_a: p_1 < p_2 \quad \text{(Drug A has a lower rate — one-sided left)}$

Step 2: Choose significance level: $\alpha = 0.05$ .

Step 3: Calculate the pooled proportion and check conditions.

$\hat{p}_1 = \frac{18}{200} = 0.09 \qquad \hat{p}_2 = \frac{30}{250} = 0.12$

$\hat{p} = \frac{18 + 30}{200 + 250} = \frac{48}{450} = 0.1067$

Large sample check (using pooled $\hat{p} = 0.1067$ ): $200 \times 0.1067 = 21.3 \geq 10$ ✓, $200 \times 0.8933 = 178.7 \geq 10$ ✓, $250 \times 0.1067 = 26.7 \geq 10$ ✓, $250 \times 0.8933 = 223.3 \geq 10$ ✓.

Step 4: Calculate the test statistic.

$SE = \sqrt{0.1067 \times 0.8933 \times \left(\frac{1}{200} + \frac{1}{250}\right)} = \sqrt{0.09531 \times 0.009} = \sqrt{0.000858} = 0.02929$

$z = \frac{0.09 - 0.12}{0.02929} = \frac{-0.03}{0.02929} = -1.024$

Step 5: Find the p-value. One-sided left:

$\text{p-value} = P(Z < -1.024) \approx 0.153$

Step 6: Make a decision. Since $0.153 > 0.05$ , we fail to reject $H_0$ .

Step 7: Conclusion in context. There is not sufficient evidence to conclude that Drug A has a lower side-effect rate than Drug B. While the sample rates differ (9% vs 12%), this 3-percentage-point difference could reasonably be due to random variation. A larger study would be needed to detect such a small difference if it truly exists.

Paired T-Test (Matched Pairs)

The paired t-test is used when you have two measurements on the same subjects — before and after a treatment, left eye and right eye, or any other naturally paired design. Because the two measurements are on the same individual, they are not independent of each other, so you cannot use a two-sample t-test. Instead, you compute the difference for each pair and then perform a one-sample t-test on those differences.

Procedure:

For each pair, calculate the difference: $d_i = x_{\text{after}} - x_{\text{before}}$ (or whichever direction is relevant)
Compute $\bar{d}$ (the mean of the differences) and $s_d$ (the standard deviation of the differences)
Test statistic: $t = \frac{\bar{d}}{s_d / \sqrt{n}}$ with $df = n - 1$ , where $n$ is the number of pairs

Conditions:

The pairs are randomly selected or randomly assigned
The differences are independent of each other (one pair does not influence another)
The population of differences is approximately normal, OR $n \geq 30$

Example 3: Blood Pressure Before and After Medication

A physician measures the systolic blood pressure of 12 patients before and after taking a new medication. The differences (after minus before) are:

$-8, -5, -12, -3, -7, -10, -4, -6, -9, -1, -11, -6$

Is there evidence the medication lowers blood pressure? Test at $\alpha = 0.01$ .

Step 1: State the hypotheses. Since negative differences mean blood pressure decreased:

$H_0: \mu_d = 0 \quad \text{(no change in blood pressure)}$

$H_a: \mu_d < 0 \quad \text{(blood pressure decreased — one-sided left)}$

Step 2: Choose significance level: $\alpha = 0.01$ .

Step 3: Check conditions. Random sample of patients ✓. Differences are independent (different patients) ✓. With $n = 12$ , we need approximate normality of the differences — the values appear roughly symmetric with no extreme outliers ✓.

Step 4: Calculate $\bar{d}$ and $s_d$ .

$\text{Sum} = -8 + (-5) + (-12) + (-3) + (-7) + (-10) + (-4) + (-6) + (-9) + (-1) + (-11) + (-6) = -82$

$\bar{d} = \frac{-82}{12} = -6.833$

To find $s_d$ , compute each deviation from the mean and square it:

$d_i$	$d_i - \bar{d}$	$(d_i - \bar{d})^2$
$-8$	$-1.167$	$1.362$
$-5$	$1.833$	$3.360$
$-12$	$-5.167$	$26.698$
$-3$	$3.833$	$14.692$
$-7$	$-0.167$	$0.028$
$-10$	$-3.167$	$10.030$
$-4$	$2.833$	$8.026$
$-6$	$0.833$	$0.694$
$-9$	$-2.167$	$4.696$
$-1$	$5.833$	$34.024$
$-11$	$-4.167$	$17.364$
$-6$	$0.833$	$0.694$

$\sum (d_i - \bar{d})^2 = 121.668$

$s_d^2 = \frac{121.668}{12 - 1} = \frac{121.668}{11} = 11.061$

$s_d = \sqrt{11.061} = 3.326$

Now calculate the test statistic:

$SE = \frac{s_d}{\sqrt{n}} = \frac{3.326}{\sqrt{12}} = \frac{3.326}{3.464} = 0.9602$

$t = \frac{-6.833}{0.9602} = -7.116$

Step 5: Find the p-value. With $df = 11$ and a one-sided left test:

$\text{p-value} = P(t < -7.116 \text{ with } df = 11) \approx 0.00001$

The p-value is extremely small — well below any conventional significance level.

Step 6: Make a decision. Since $0.00001 < 0.01$ , we reject $H_0$ .

Step 7: Conclusion in context. There is overwhelming evidence that the medication lowers systolic blood pressure. The average reduction of 6.8 mmHg is both statistically significant and clinically meaningful. Every single patient in the sample showed a decrease, and the magnitude of the average drop exceeds the threshold that physicians consider relevant for cardiovascular risk reduction.

Why Paired Instead of Two-Sample?

You might wonder why we do not simply compare the “before” group and the “after” group with a two-sample t-test. The answer is that the paired design is far more powerful. By computing differences within each subject, you eliminate the large person-to-person variability in baseline blood pressure. Patient A might have a baseline of 160 and Patient B might have a baseline of 120, but the medication effect within each patient can be isolated. The standard deviation of the differences ( $s_d = 3.326$ ) is much smaller than the standard deviation of the raw measurements would be, giving a much larger test statistic and a much smaller p-value.

Choosing the Right Two-Sample Test

Situation	Test	Key Formula
Comparing means of two independent groups, $\sigma$ unknown	Two-sample t-test (Welch)	$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}$
Comparing proportions of two independent groups	Two-proportion z-test	$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1 + 1/n_2)}}$
Comparing two measurements on the same subjects	Paired t-test	$t = \frac{\bar{d}}{s_d / \sqrt{n}}$

Decision process:

Are you comparing means or proportions?
- Means: go to step 2
- Proportions: use the two-proportion z-test
Are the two groups independent or paired?
- Independent (different subjects in each group): use the two-sample t-test
- Paired (same subjects measured twice, or naturally matched pairs): use the paired t-test

A common mistake is using a two-sample t-test when the data is actually paired. This ignores the pairing structure and loses statistical power. Always ask: “Are the two measurements connected to the same individual?” If yes, use the paired t-test.

Real-World Application: Nursing — Comparing Treatment Protocols

A hospital evaluates two wound-care protocols. Protocol A (the current standard) is used on 35 randomly assigned patients, with a mean healing time of $\bar{x}_1 = 14.2$ days and $s_1 = 3.8$ days. Protocol B (a new approach) is used on 40 patients, with $\bar{x}_2 = 12.5$ days and $s_2 = 4.1$ days. Is Protocol B faster? Test at $\alpha = 0.05$ .

$H_0: \mu_1 - \mu_2 = 0 \qquad H_a: \mu_1 - \mu_2 > 0$

(Note: $H_a$ tests whether Protocol A takes longer, meaning Protocol B is faster.)

$SE = \sqrt{\frac{3.8^2}{35} + \frac{4.1^2}{40}} = \sqrt{\frac{14.44}{35} + \frac{16.81}{40}} = \sqrt{0.4126 + 0.4203} = \sqrt{0.8329} = 0.9126$

$t = \frac{14.2 - 12.5}{0.9126} = \frac{1.7}{0.9126} = 1.863$

Using conservative $df = \min(34, 39) = 34$ :

$\text{p-value} = P(t > 1.863, df = 34) \approx 0.036$

Since $0.036 < 0.05$ , reject $H_0$ . There is evidence Protocol B produces faster healing. The 1.7-day average improvement is clinically relevant: fewer inpatient days, lower infection risk, and reduced treatment costs. The nursing team should consider transitioning to Protocol B, with continued monitoring to confirm the results hold across a broader patient population.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: A school district compares math scores between two teaching methods. Method 1 (

n_1 = 28

\bar{x}_1 = 78

s_1 = 9

. Method 2 (

n_2 = 32

\bar{x}_2 = 73

s_2 = 11

. Is there evidence that Method 1 is better? (

\alpha = 0.05

)

$H_0: \mu_1 - \mu_2 = 0$ , $H_a: \mu_1 - \mu_2 > 0$ (one-sided right)

$SE = \sqrt{\frac{9^2}{28} + \frac{11^2}{32}} = \sqrt{\frac{81}{28} + \frac{121}{32}} = \sqrt{2.893 + 3.781} = \sqrt{6.674} = 2.583$

$t = \frac{78 - 73}{2.583} = \frac{5}{2.583} = 1.936$

Conservative $df = \min(27, 31) = 27$ . $P(t > 1.936, df = 27) \approx 0.032$ .

Since $0.032 < 0.05$ , reject $H_0$ .

Answer: There is statistically significant evidence that Method 1 produces higher math scores than Method 2. The 5-point average difference is statistically significant.

Problem 2: A marketing team tests two ad designs. Ad A: 45 clicks out of 500 impressions. Ad B: 60 clicks out of 500 impressions. Is Ad B’s click rate higher? (

\alpha = 0.05

)

$H_0: p_1 = p_2$ , $H_a: p_1 < p_2$ (one-sided, testing if Ad B is higher means Ad A is lower)

$\hat{p}_1 = \frac{45}{500} = 0.09 \qquad \hat{p}_2 = \frac{60}{500} = 0.12$

$\hat{p} = \frac{45 + 60}{500 + 500} = \frac{105}{1000} = 0.105$

$SE = \sqrt{0.105 \times 0.895 \times \left(\frac{1}{500} + \frac{1}{500}\right)} = \sqrt{0.09398 \times 0.004} = \sqrt{0.000376} = 0.01939$

$z = \frac{0.09 - 0.12}{0.01939} = \frac{-0.03}{0.01939} = -1.547$

$P(Z < -1.547) \approx 0.061$ .

Since $0.061 > 0.05$ , fail to reject $H_0$ .

Answer: There is not sufficient evidence at the 5% level that Ad B has a higher click-through rate. The observed difference of 3 percentage points could be due to chance. A larger sample size would increase the power to detect such a difference.

Problem 3: Ten runners are timed on a 5K before and after a training program. The differences (after minus before, in seconds) are:

-15, -8, -22, 3, -12, -18, -5, -10, -7, -14

. Is there evidence the training reduces 5K time? (

\alpha = 0.05

)

$H_0: \mu_d = 0$ , $H_a: \mu_d < 0$ (one-sided left)

$\text{Sum} = -15 + (-8) + (-22) + 3 + (-12) + (-18) + (-5) + (-10) + (-7) + (-14) = -108$

$\bar{d} = \frac{-108}{10} = -10.8$

Deviations from mean: $-4.2, 2.8, -11.2, 13.8, -1.2, -7.2, 5.8, 0.8, 3.8, -3.2$

Squared deviations: $17.64, 7.84, 125.44, 190.44, 1.44, 51.84, 33.64, 0.64, 14.44, 10.24$

$\sum = 453.60 \qquad s_d^2 = \frac{453.60}{9} = 50.40 \qquad s_d = 7.099$

$SE = \frac{7.099}{\sqrt{10}} = \frac{7.099}{3.162} = 2.245$

$t = \frac{-10.8}{2.245} = -4.811$

With $df = 9$ , $P(t < -4.811) \approx 0.0005$ .

Since $0.0005 < 0.05$ , reject $H_0$ .

Answer: There is strong evidence the training program reduces 5K time. The average improvement of 10.8 seconds is highly statistically significant. Nine of the ten runners improved.

Problem 4: Two hospitals compare infection rates. Hospital X: 12 infections out of 300 surgeries. Hospital Y: 20 infections out of 350 surgeries. Is there a difference in infection rates? (

\alpha = 0.05

, two-sided)

$H_0: p_1 = p_2$ , $H_a: p_1 \neq p_2$

$\hat{p}_1 = \frac{12}{300} = 0.04 \qquad \hat{p}_2 = \frac{20}{350} = 0.05714$

$\hat{p} = \frac{12 + 20}{300 + 350} = \frac{32}{650} = 0.04923$

$SE = \sqrt{0.04923 \times 0.95077 \times \left(\frac{1}{300} + \frac{1}{350}\right)} = \sqrt{0.04681 \times 0.006190} = \sqrt{0.000290} = 0.01703$

$z = \frac{0.04 - 0.05714}{0.01703} = \frac{-0.01714}{0.01703} = -1.006$

Two-sided p-value $= 2 \times P(Z < -1.006) \approx 2 \times 0.157 = 0.314$ .

Since $0.314 > 0.05$ , fail to reject $H_0$ .

Answer: There is not sufficient evidence of a difference in infection rates between the two hospitals. The observed difference (4% vs 5.7%) is not statistically significant.

Problem 5: A dietitian measures cholesterol levels of 20 patients before and after a 12-week diet program. The mean difference (after minus before) is

\bar{d} = -18.5

mg/dL with

s_d = 22.0

mg/dL. Test whether the diet reduces cholesterol at

\alpha = 0.05

$H_0: \mu_d = 0$ , $H_a: \mu_d < 0$ (one-sided left)

$SE = \frac{22.0}{\sqrt{20}} = \frac{22.0}{4.472} = 4.920$

$t = \frac{-18.5}{4.920} = -3.760$

With $df = 19$ , $P(t < -3.760) \approx 0.0007$ .

Since $0.0007 < 0.05$ , reject $H_0$ .

Answer: There is strong evidence that the diet program reduces cholesterol. The average decrease of 18.5 mg/dL is both statistically significant and clinically meaningful — a reduction of this magnitude is associated with meaningful cardiovascular risk reduction.

Key Takeaways

The two-sample t-test (Welch’s) compares the means of two independent groups using $t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}$
The two-proportion z-test compares proportions from two independent groups using a pooled proportion under $H_0$
The paired t-test compares two measurements on the same subjects by analyzing the differences: $t = \frac{\bar{d}}{s_d/\sqrt{n}}$
Choosing the right test depends on two questions: (1) Are you comparing means or proportions? (2) Are the groups independent or paired?
Using a two-sample test when data is actually paired wastes statistical power — the pairing removes between-subject variability
Welch’s t-test is preferred over the pooled t-test because it does not assume equal variances
For the two-proportion z-test, always use the pooled proportion when calculating the standard error under $H_0$
In clinical settings, two-sample tests are essential for comparing treatment protocols, evaluating new interventions, and making evidence-based decisions about patient care
As with all hypothesis tests, consider both statistical significance and practical significance — a statistically significant difference may or may not be large enough to matter in practice

Return to Statistics for more topics in this section.

Next Up in Statistics

One-Sample T-Test One-Sample Z-Tests Addition Rule of Probability One-Way ANOVA

All Statistics topics

Last updated: March 29, 2026