Introduction to Hypothesis Testing
You should be comfortable with:
Medication dosages, IV drip rates, vital monitoring
Hypothesis testing is how we use data to make decisions. Is a new drug more effective than the current treatment? Is a factory’s defect rate higher than the acceptable threshold? Is there a real difference between two teaching methods, or is the observed difference just due to random chance? These questions all follow the same logical framework — and that framework is hypothesis testing. In this lesson, you will learn the core concepts: null and alternative hypotheses, test statistics, p-values, significance levels, and the types of errors that can occur.
The Logic of Hypothesis Testing
The reasoning behind a hypothesis test follows a structured sequence:
- Start with a claim — assume some default position (the null hypothesis) is true
- Collect data — draw a random sample and compute a relevant statistic
- Ask a key question — if the null hypothesis were true, how surprising is the data we observed?
- Make a decision — if the data would be very surprising under the null hypothesis, reject it. If the data is consistent with the null, fail to reject it.
A helpful analogy is the legal standard of “innocent until proven guilty.” In a trial, the default assumption (null hypothesis) is that the defendant is innocent. The prosecution must present evidence strong enough to reject that assumption beyond a reasonable doubt. If the evidence is not strong enough, the defendant is found “not guilty” — which is not the same as saying “innocent.” Similarly, failing to reject the null hypothesis does not prove it is true; it only means the data did not provide strong enough evidence against it.
Null and Alternative Hypotheses
Every hypothesis test begins with two competing statements about the population:
- (the null hypothesis): the default claim — typically “no effect,” “no difference,” or “the parameter equals a specific value”
- (the alternative hypothesis): what you suspect or want to demonstrate — typically “there is an effect,” “there is a difference,” or “the parameter differs from the null value”
The null hypothesis always contains an equals sign. The alternative hypothesis specifies the direction of the difference (or lack thereof).
Examples of Hypothesis Pairs
Drug effectiveness:
- (the drug is no better than a placebo — 50% recovery rate)
- (the drug produces a higher recovery rate)
Manufacturing quality:
- g (the machine fills packages to the target weight)
- g (the machine is off target — too high or too low)
Comparing two groups:
- (no difference between the two groups)
- (the groups differ)
One-Sided vs Two-Sided Tests
- A one-sided (one-tailed) test has an alternative hypothesis with a direction: or . Use this when you only care about a deviation in one direction.
- A two-sided (two-tailed) test has an alternative hypothesis without a specific direction: or . Use this when a deviation in either direction would be important.
When in doubt, use a two-sided test — it is more conservative and does not assume the direction of the effect in advance.
Test Statistics
A test statistic is a single number that measures how far the sample result is from what the null hypothesis predicts. The general form is:
This is essentially a z-score or t-score: it tells you how many standard errors the observed result is from the hypothesized value. A test statistic of 0 means the data matches the null hypothesis perfectly. A large positive or negative test statistic means the data is far from what the null hypothesis would predict.
For a proportion:
Note that in hypothesis testing for proportions, the standard error uses the null value (not the sample proportion ) because we are asking: “If were the true proportion, how unlikely is our sample?”
For a mean (when is unknown):
P-Values — What They Mean
The p-value is the probability of observing a result as extreme as (or more extreme than) the sample data, assuming the null hypothesis is true. It answers the question: “If is true, how likely are we to see data this extreme just by random chance?”
- A small p-value (e.g., 0.003) means the observed data would be very unlikely under . This is strong evidence against the null hypothesis.
- A large p-value (e.g., 0.42) means the observed data is consistent with . There is no strong evidence against the null hypothesis.
What the P-Value is NOT
The p-value is one of the most misinterpreted concepts in statistics. Here are common errors:
- The p-value is NOT the probability that is true. It is the probability of the data (or more extreme data) given that is true — a conditional probability, not a direct statement about .
- The p-value is NOT the probability that the result is due to chance. It is calculated under the assumption that chance alone is operating.
- A large p-value does NOT prove is true. It only means the data is consistent with — the data might also be consistent with other hypotheses.
Example 1: Testing a Coin for Fairness
You flip a coin 100 times and get 62 heads. Is the coin fair?
Step 1: State the hypotheses.
Step 2: Calculate the test statistic. Under , the standard error uses :
Step 3: Find the p-value. Since this is a two-sided test:
Interpretation: If the coin were truly fair, there is only a 1.64% chance of getting a result as extreme as 62 (or more) heads out of 100 flips. This is fairly strong evidence against the coin being fair.
Significance Level ()
The significance level is the threshold you set before collecting data that determines how small the p-value must be to reject . It represents the maximum probability of making a Type I error (rejecting a true null hypothesis) that you are willing to tolerate.
Common significance levels:
| Meaning | |
|---|---|
| 0.10 | 10% risk of false positive — used in exploratory studies |
| 0.05 | 5% risk — the most common standard in research |
| 0.01 | 1% risk — used when consequences of a false positive are severe |
Decision rule:
- If p-value : reject . The result is statistically significant.
- If p-value : fail to reject . The result is not statistically significant.
Example 1 continued: With , the p-value of 0.0164 is less than 0.05. We reject and conclude there is statistically significant evidence that the coin is not fair.
Note the careful language: we say “fail to reject ,” not “accept .” Absence of evidence against the null is not evidence that the null is true.
Type I and Type II Errors
Because hypothesis tests are based on sample data, there is always a chance of making the wrong decision. There are exactly two types of errors:
- Type I Error (False Positive): Rejecting when it is actually true. You conclude there is an effect when there really is not one. The probability of this error equals .
- Type II Error (False Negative): Failing to reject when it is actually false. You miss a real effect. The probability of this error is denoted .
The following table summarizes all possible outcomes:
| is True | is False | |
|---|---|---|
| Reject | Type I Error (probability = ) | Correct decision (probability = ) |
| Fail to reject | Correct decision (probability = ) | Type II Error (probability = ) |
Real-world consequences:
- Type I Error in medicine: Approving a drug that does not actually work. Patients receive an ineffective treatment, and resources are wasted.
- Type II Error in medicine: Failing to approve a drug that does work. Patients miss out on an effective treatment.
There is a trade-off between the two errors: lowering (making it harder to reject ) reduces the risk of Type I errors but increases the risk of Type II errors. The only way to reduce both simultaneously is to increase the sample size.
Statistical Power
Power is the probability of correctly rejecting a false null hypothesis:
A test with high power is good at detecting real effects. A test with low power may miss real effects, leading to inconclusive results.
Three main factors affect power:
- Sample size — larger samples provide more information and increase power. This is the factor researchers have the most control over.
- Effect size — a larger true difference from the null value is easier to detect. You cannot control this, but you can design studies around the smallest effect you consider practically meaningful.
- Significance level () — a larger (say 0.10 instead of 0.05) makes it easier to reject , increasing power. However, this also increases the Type I error rate.
A commonly cited target for power is 0.80 (80%), meaning the test has an 80% chance of detecting a real effect if one exists. Power analysis — calculating the sample size needed to achieve a desired power — is a critical step in study planning.
Statistical Significance vs Practical Significance
A result can be statistically significant (small p-value) but practically meaningless. This happens when the sample size is very large: with enough data, even a tiny, unimportant difference can produce a p-value below 0.05.
For example, suppose a study of 100,000 patients finds that a new blood pressure medication lowers systolic pressure by 0.5 mmHg compared to the standard treatment, with a p-value of 0.001. The result is statistically significant — but a 0.5 mmHg reduction has virtually no clinical impact. No doctor would change their prescribing based on such a small difference.
Always consider the effect size alongside the p-value. Ask: “Is the observed difference large enough to matter in practice?” Report confidence intervals whenever possible — they convey both the direction and the magnitude of the effect, which a p-value alone cannot do.
The Steps of a Hypothesis Test
Every hypothesis test follows the same seven-step framework:
- State and — define the null and alternative hypotheses in terms of a population parameter
- Choose the significance level — typically 0.05 unless there is a reason to use a different threshold
- Check conditions — verify that the sample is random, observations are independent, and the appropriate distributional assumptions are met
- Calculate the test statistic — measure how far the sample result is from the null value, in standard error units
- Find the p-value — determine the probability of observing a result this extreme (or more extreme) if is true
- Make a decision — compare the p-value to and reject or fail to reject
- State the conclusion in context — translate the statistical decision into a plain-language statement about the original research question
Following this framework ensures that every test is conducted systematically and transparently.
Real-World Application: Nursing — Evaluating ER Wait Times
A hospital administration claims that the average emergency room wait time is 30 minutes. A nursing researcher suspects the actual wait time is longer. She collects data from a random sample of patients and finds minutes and minutes. Is there evidence that the average wait time exceeds 30 minutes? Use .
Step 1: State the hypotheses.
Step 2: Significance level: .
Step 3: Check conditions. Random sample ✓. Independence (50 patients is less than 10% of all ER patients) ✓. Sample size , so the CLT applies ✓.
Step 4: Calculate the test statistic.
Step 5: Find the p-value. With and (one-sided, upper tail):
Step 6: Decision. Since , we reject .
Step 7: Conclusion in context. There is statistically significant evidence at the level that the average ER wait time exceeds 30 minutes. The sample data suggests the true mean is around 34 minutes. The hospital should investigate causes of the longer wait times and consider operational changes.
Clinical note: This finding is also practically significant — a 4-minute increase over the claimed wait time represents a meaningful difference in patient experience and potential triage delays, especially during peak hours.
Practice Problems
Test your understanding with these problems. Click to reveal each answer.
Problem 1: A company claims that 90% of its orders ship on time. A consumer group surveys 200 recent orders and finds that 168 shipped on time. Is there evidence that the on-time rate is less than 90%? ()
, (one-sided, lower tail)
Since , reject .
Answer: There is statistically significant evidence that the on-time shipping rate is less than 90%. The sample estimate of 84% suggests a meaningful shortfall.
Problem 2: A coin is flipped 200 times and lands heads 108 times. Test whether the coin is fair at .
, (two-sided)
Since , fail to reject .
Answer: There is not sufficient evidence to conclude the coin is unfair. Getting 108 heads out of 200 flips is well within the range of normal variation for a fair coin.
Problem 3: Identify the type of error in each scenario: (a) A drug is approved based on trial data, but later turns out to be ineffective. (b) A useful teaching method is dismissed because a small study found no significant improvement.
(a) This is a Type I Error (false positive). The null hypothesis (“the drug has no effect”) was true, but it was rejected based on the sample data.
(b) This is a Type II Error (false negative). The null hypothesis (“the teaching method has no effect”) was false (the method does work), but the study failed to reject it — likely due to low power from the small sample size.
Problem 4: A nutritionist believes the average daily calorie intake of college students exceeds 2,000 calories. A sample of 35 students shows and . Test at .
, (one-sided)
With , .
Since , fail to reject .
Answer: At the level, there is not sufficient evidence to conclude that the mean daily calorie intake exceeds 2,000 calories. Note: at , this result would be significant — the choice of significance level matters.
Problem 5: A hospital’s historical infection rate is 5%. After implementing a new hygiene protocol, a sample of 500 patients shows 16 infections. Is there evidence the rate has decreased? ()
, (one-sided, lower tail)
Since , reject .
Answer: There is statistically significant evidence that the infection rate has decreased below 5% after the new protocol. The sample rate of 3.2% represents a meaningful improvement, supporting continued use of the new hygiene procedures.
Key Takeaways
- Hypothesis testing is a structured framework for using sample data to make decisions about population parameters
- The null hypothesis () represents the status quo (no effect, no difference); the alternative hypothesis () represents what you want to demonstrate
- The test statistic measures how far the sample result is from the null value in standard error units
- The p-value is the probability of observing data as extreme as yours if were true — a small p-value provides evidence against
- Compare the p-value to your significance level : if p-value , reject ; otherwise, fail to reject
- Type I Error (false positive, probability ) means rejecting a true ; Type II Error (false negative, probability ) means failing to reject a false
- Power () is the ability to detect a real effect — it increases with larger sample sizes, larger effect sizes, and larger
- Statistical significance does not imply practical significance — always consider the effect size and real-world context alongside the p-value
- In healthcare, hypothesis tests help evaluate new treatments, monitor quality metrics, and make evidence-based decisions that directly affect patient care
Return to Statistics for more topics in this section.
Next Up in Statistics
All Statistics topicsLast updated: March 29, 2026