Inference for Regression
Medication dosages, IV drip rates, vital monitoring
So far, we have calculated the regression line from sample data — finding the slope, intercept, and . But there is a deeper question: is the linear relationship real, or could it be due to chance? Even if two variables have no relationship in the population, a random sample could produce a nonzero slope just from sampling variability. Inference for regression uses hypothesis tests and confidence intervals to determine whether the observed slope reflects a genuine linear relationship in the population.
The Question: Is There a Real Linear Relationship?
The sample slope estimates the true population slope (the Greek letter beta). If there is no linear relationship in the population, then — the regression line would be flat. Our goal is to test whether the data provides convincing evidence that is not zero.
The hypotheses are:
If we reject , we have evidence that and are linearly related in the population. If we fail to reject , the apparent trend in our sample could be due to random variation — we cannot conclude that a linear relationship exists.
Conditions for Regression Inference
Before performing inference, you must verify that four conditions are met. A useful mnemonic is LINE:
- L — Linearity: The true relationship between and is linear. Check this by examining the scatter plot and the residual plot for any curved pattern.
- I — Independence: The observations are independent of each other. This is usually satisfied when data comes from a random sample, or when the sample size is less than 10% of the population.
- N — Normality of residuals: The residuals follow an approximately normal distribution at each value of . Check with a histogram or normal probability plot of the residuals. This condition is less critical for large samples due to the Central Limit Theorem.
- E — Equal variance (homoscedasticity): The spread of the residuals is roughly the same across all values of . Check the residual plot — the vertical spread of dots should not fan out or narrow as increases.
If any of these conditions is seriously violated, the p-values and confidence intervals from the regression may not be trustworthy.
The Regression Standard Error
Before we can test the slope, we need a measure of how much the data points scatter around the regression line. The regression standard error (also called the standard error of the estimate or residual standard error) is:
This is essentially the “typical” size of a residual. We divide by (not ) because we lose two degrees of freedom — one for estimating the slope and one for estimating the intercept.
The Standard Error of the Slope
The standard error of the slope measures how much the sample slope would vary from sample to sample:
A smaller means the slope estimate is more precise. Two things reduce :
- Smaller residuals (smaller ) — the data fits the line more tightly
- More spread in the -values (larger ) — the data covers a wider range of
The degrees of freedom for this estimate are .
T-Test for the Slope
The test statistic for testing follows a -distribution:
This has the familiar form: (estimate minus null value) divided by standard error. A large means the observed slope is many standard errors away from zero, which is strong evidence against .
Example 1: Testing the Study Hours Slope
Using the study hours vs test score data from the linear regression page (, ), we need to compute the standard error of the slope.
Step 1: Gather the residuals (computed on the linear regression page).
| Residual () | |||
|---|---|---|---|
| 2 | 65 | 64.24 | 0.76 |
| 3 | 70 | 69.53 | 0.47 |
| 5 | 80 | 80.10 | |
| 4 | 75 | 74.82 | 0.18 |
| 6 | 85 | 85.39 | |
| 1 | 58 | 58.95 |
Step 2: Compute the sum of squared residuals.
Step 3: Compute the regression standard error ().
Step 4: Compute where .
Step 5: Compute the standard error of the slope.
Step 6: Compute the -statistic.
Step 7: Find the p-value. With and , the p-value is astronomically small — essentially 0. (For reference, the critical value for a two-sided test at with is . Our test statistic of 32.2 far exceeds this threshold.)
Conclusion: There is overwhelming evidence of a linear relationship between study hours and test scores. The extremely large -statistic (and essentially zero p-value) means it is virtually impossible that a slope this steep would arise by chance if no linear relationship existed.
Confidence Interval for the Slope
A confidence interval gives a range of plausible values for the true population slope :
Where is the critical value from the -distribution with for the desired confidence level.
Example 2: 95% Confidence Interval for the Slope
Using the values from Example 1: , , .
For a 95% confidence interval with , the critical value is .
Interpretation: We are 95% confident that for each additional hour of study, the true average test score increase is between 4.83 and 5.74 points.
Notice that this interval does not contain 0. This is consistent with our hypothesis test result — since we rejected , the confidence interval should not include 0. In general:
- If the confidence interval for does not contain 0 → you would reject at the corresponding significance level
- If the confidence interval for does contain 0 → you would fail to reject
Reading Computer Output
In practice, you will rarely compute regression inference by hand. Software (Excel, SPSS, R, Python, TI calculators) provides a regression output table. Here is how to read a typical one:
| Coefficient | Std Error | t-Statistic | P-value | |
|---|---|---|---|---|
| Intercept | 53.667 | 0.640 | 83.8 | < 0.00001 |
| Hours () | 5.286 | 0.164 | 32.2 | 0.000004 |
Additional output:
| Statistic | Value |
|---|---|
| 0.996 | |
| Adjusted | 0.995 |
| Standard Error () | 0.688 |
| 6 |
How to read this table:
- Coefficient column: The intercept () and slope () of the regression line
- Std Error column: The standard error of each coefficient. For the slope,
- t-Statistic column: The test statistic for testing whether each coefficient equals zero. For the slope,
- P-value column: The p-value for the two-sided test . For the slope,
- : The coefficient of determination — proportion of variation in explained by the model
- Standard Error (): The regression standard error — the typical size of a residual
The slope row is the most important line. If the p-value for the slope is small (typically below 0.05), you conclude that there is a statistically significant linear relationship.
Connecting Everything: The Big Picture
Here is how correlation, regression, and inference fit together:
- Scatter plot — visualize the relationship
- Correlation () — measure the strength and direction of the linear association
- Regression line () — describe the relationship with an equation and make predictions
- — quantify how well the line fits the data
- Inference (-test and CI for slope) — determine whether the relationship is real or could be due to chance
Each step builds on the previous one. You should never skip straight to inference without first examining the scatter plot and checking the LINE conditions.
Real-World Application: Nursing — Does BMI Predict Blood Pressure?
A nurse researcher wants to know whether body mass index (BMI) is linearly associated with systolic blood pressure. She collects data from patients and runs a regression. The software output shows:
| Coefficient | Std Error | t-Statistic | P-value | |
|---|---|---|---|---|
| Intercept | 82.4 | 8.7 | 9.47 | 0.0000 |
| BMI () | 1.85 | 0.32 | 5.78 | 0.0000 |
Additional: , , .
Interpreting the output:
- Slope: For each one-unit increase in BMI, systolic blood pressure is predicted to increase by 1.85 mmHg.
- P-value for slope: The p-value is essentially 0, providing very strong evidence that BMI and systolic BP are linearly related.
- : About 54.4% of the variation in systolic blood pressure is explained by BMI. The remaining 45.6% is due to other factors (age, genetics, medication, diet, stress, etc.).
- 95% CI for slope: . We are 95% confident the true increase in systolic BP per unit of BMI is between 1.19 and 2.51 mmHg.
Clinical implications: BMI is a statistically significant predictor of systolic blood pressure, but it explains only about half the variation. Nurses should consider BMI as one factor among many when assessing cardiovascular risk. A patient with a high BMI but normal blood pressure may have other protective factors, and a patient with a normal BMI but elevated blood pressure needs further evaluation.
Practice Problems
Test your understanding with these problems. Click to reveal each answer.
Problem 1: A regression analysis produces , , and . Test whether the slope is significantly different from zero at .
Step 1: State hypotheses: ,
Step 2: Compute the test statistic.
Step 3: Degrees of freedom:
Step 4: Find the p-value. For with (two-sided):
Step 5: Since , reject .
Answer: There is statistically significant evidence of a linear relationship. The slope of 2.4 is significantly different from zero at the 5% significance level.
Problem 2: Using the values from Problem 1, construct a 95% confidence interval for the slope. The critical value is for .
Answer: We are 95% confident that the true slope is between 0.73 and 4.07. Since the interval does not contain 0, this is consistent with rejecting .
Problem 3: A residual plot shows residuals that fan out — small spread on the left and large spread on the right. Which LINE condition is violated, and what does this mean?
The E (Equal variance) condition is violated. This pattern is called heteroscedasticity. It means the variability of is not constant across all values of — the predictions are more precise for some -values than others.
Consequences: The standard errors, p-values, and confidence intervals from the regression may be unreliable. Possible remedies include transforming the response variable (e.g., using instead of ) or using weighted least squares regression.
Problem 4: A regression of advertising spending () vs revenue () for stores gives: , , . (a) Test the slope at . (b) Interpret .
(a) ,
For with (two-sided):
Since , reject . There is significant evidence that advertising spending is linearly related to revenue.
(b) means that 58% of the variation in revenue is explained by the linear relationship with advertising spending. The remaining 42% is due to other factors such as location, competition, season, and store size.
Problem 5: A 95% confidence interval for a regression slope is . What would the conclusion of the corresponding hypothesis test be at ?
Since the confidence interval contains 0, we would fail to reject at the significance level.
Interpretation: There is not sufficient evidence of a linear relationship between and . The true slope could plausibly be zero (no relationship), negative, or positive — the data does not allow us to distinguish.
Key connection: A confidence interval that contains 0 always corresponds to a hypothesis test that fails to reject at the same confidence level.
Key Takeaways
- Inference for regression determines whether the observed linear relationship in a sample reflects a real relationship in the population or could be due to chance
- The null hypothesis states that the true population slope is zero (no linear relationship)
- Before performing inference, verify the LINE conditions: Linearity, Independence, Normal residuals, and Equal variance
- The standard error of the slope measures how much the sample slope would vary across different samples
- The -test for the slope uses with — a large and small p-value provide evidence against
- A confidence interval for the slope () gives a range of plausible values for the true population slope
- If the confidence interval for does not contain 0, you reject ; if it contains 0, you fail to reject
- In computer output, the slope row is the most important — look at the coefficient, standard error, -statistic, and p-value
- In clinical and applied settings, always consider both statistical significance (is the slope significantly different from zero?) and practical significance (is the effect large enough to matter?)
Return to Statistics for more topics in this section.
Next Up in Statistics
All Statistics topicsLast updated: March 29, 2026