Statistics

Inference for Regression

Last updated: March 2026 · Advanced
Before you start

You should be comfortable with:

Real-world applications
💊
Nursing

Medication dosages, IV drip rates, vital monitoring

So far, we have calculated the regression line from sample data — finding the slope, intercept, and R2R^2. But there is a deeper question: is the linear relationship real, or could it be due to chance? Even if two variables have no relationship in the population, a random sample could produce a nonzero slope just from sampling variability. Inference for regression uses hypothesis tests and confidence intervals to determine whether the observed slope reflects a genuine linear relationship in the population.

The Question: Is There a Real Linear Relationship?

The sample slope bb estimates the true population slope β\beta (the Greek letter beta). If there is no linear relationship in the population, then β=0\beta = 0 — the regression line would be flat. Our goal is to test whether the data provides convincing evidence that β\beta is not zero.

The hypotheses are:

H0:β=0(no linear relationship — the true slope is zero)H_0: \beta = 0 \quad \text{(no linear relationship — the true slope is zero)}

Ha:β0(there IS a linear relationship — the true slope is not zero)H_a: \beta \neq 0 \quad \text{(there IS a linear relationship — the true slope is not zero)}

If we reject H0H_0, we have evidence that xx and yy are linearly related in the population. If we fail to reject H0H_0, the apparent trend in our sample could be due to random variation — we cannot conclude that a linear relationship exists.

Conditions for Regression Inference

Before performing inference, you must verify that four conditions are met. A useful mnemonic is LINE:

  • L — Linearity: The true relationship between xx and yy is linear. Check this by examining the scatter plot and the residual plot for any curved pattern.
  • I — Independence: The observations are independent of each other. This is usually satisfied when data comes from a random sample, or when the sample size is less than 10% of the population.
  • N — Normality of residuals: The residuals follow an approximately normal distribution at each value of xx. Check with a histogram or normal probability plot of the residuals. This condition is less critical for large samples due to the Central Limit Theorem.
  • E — Equal variance (homoscedasticity): The spread of the residuals is roughly the same across all values of xx. Check the residual plot — the vertical spread of dots should not fan out or narrow as xx increases.

If any of these conditions is seriously violated, the p-values and confidence intervals from the regression may not be trustworthy.

The Regression Standard Error

Before we can test the slope, we need a measure of how much the data points scatter around the regression line. The regression standard error (also called the standard error of the estimate or residual standard error) is:

s=(yiy^i)2n2s = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n - 2}}

This is essentially the “typical” size of a residual. We divide by n2n - 2 (not nn) because we lose two degrees of freedom — one for estimating the slope and one for estimating the intercept.

The Standard Error of the Slope

The standard error of the slope measures how much the sample slope bb would vary from sample to sample:

SEb=s(xixˉ)2SE_b = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}}

A smaller SEbSE_b means the slope estimate is more precise. Two things reduce SEbSE_b:

  1. Smaller residuals (smaller ss) — the data fits the line more tightly
  2. More spread in the xx-values (larger (xixˉ)2\sum(x_i - \bar{x})^2) — the data covers a wider range of xx

The degrees of freedom for this estimate are df=n2df = n - 2.

T-Test for the Slope

The test statistic for testing H0:β=0H_0: \beta = 0 follows a tt-distribution:

t=b0SEb=bSEbt = \frac{b - 0}{SE_b} = \frac{b}{SE_b}

This has the familiar form: (estimate minus null value) divided by standard error. A large t|t| means the observed slope is many standard errors away from zero, which is strong evidence against H0H_0.

Example 1: Testing the Study Hours Slope

Using the study hours vs test score data from the linear regression page (n=6n = 6, b=5.286b = 5.286), we need to compute the standard error of the slope.

Step 1: Gather the residuals (computed on the linear regression page).

xxyyy^\hat{y}Residual (yy^y - \hat{y})
26564.240.76
37069.530.47
58080.100.10-0.10
47574.820.18
68585.390.39-0.39
15858.950.95-0.95

Step 2: Compute the sum of squared residuals.

(yiy^i)2=0.762+0.472+0.102+0.182+0.392+0.952\sum(y_i - \hat{y}_i)^2 = 0.76^2 + 0.47^2 + 0.10^2 + 0.18^2 + 0.39^2 + 0.95^2

=0.578+0.221+0.010+0.032+0.152+0.903=1.896= 0.578 + 0.221 + 0.010 + 0.032 + 0.152 + 0.903 = 1.896

Step 3: Compute the regression standard error (df=n2=4df = n - 2 = 4).

s=1.89662=1.8964=0.474=0.688s = \sqrt{\frac{1.896}{6 - 2}} = \sqrt{\frac{1.896}{4}} = \sqrt{0.474} = 0.688

Step 4: Compute (xixˉ)2\sum(x_i - \bar{x})^2 where xˉ=3.5\bar{x} = 3.5.

(xixˉ)2=(23.5)2+(33.5)2+(53.5)2+(43.5)2+(63.5)2+(13.5)2\sum(x_i - \bar{x})^2 = (2 - 3.5)^2 + (3 - 3.5)^2 + (5 - 3.5)^2 + (4 - 3.5)^2 + (6 - 3.5)^2 + (1 - 3.5)^2

=2.25+0.25+2.25+0.25+6.25+6.25=17.5= 2.25 + 0.25 + 2.25 + 0.25 + 6.25 + 6.25 = 17.5

Step 5: Compute the standard error of the slope.

SEb=0.68817.5=0.6884.183=0.164SE_b = \frac{0.688}{\sqrt{17.5}} = \frac{0.688}{4.183} = 0.164

Step 6: Compute the tt-statistic.

t=5.2860.164=32.2t = \frac{5.286}{0.164} = 32.2

Step 7: Find the p-value. With df=4df = 4 and t=32.2t = 32.2, the p-value is astronomically small — essentially 0. (For reference, the critical value for a two-sided test at α=0.05\alpha = 0.05 with df=4df = 4 is t=2.776t^* = 2.776. Our test statistic of 32.2 far exceeds this threshold.)

Conclusion: There is overwhelming evidence of a linear relationship between study hours and test scores. The extremely large tt-statistic (and essentially zero p-value) means it is virtually impossible that a slope this steep would arise by chance if no linear relationship existed.

Confidence Interval for the Slope

A confidence interval gives a range of plausible values for the true population slope β\beta:

b±tSEbb \pm t^* \cdot SE_b

Where tt^* is the critical value from the tt-distribution with df=n2df = n - 2 for the desired confidence level.

Example 2: 95% Confidence Interval for the Slope

Using the values from Example 1: b=5.286b = 5.286, SEb=0.164SE_b = 0.164, df=4df = 4.

For a 95% confidence interval with df=4df = 4, the critical value is t=2.776t^* = 2.776.

5.286±2.776(0.164)=5.286±0.4555.286 \pm 2.776(0.164) = 5.286 \pm 0.455

(4.83, 5.74)(4.83,\ 5.74)

Interpretation: We are 95% confident that for each additional hour of study, the true average test score increase is between 4.83 and 5.74 points.

Notice that this interval does not contain 0. This is consistent with our hypothesis test result — since we rejected H0:β=0H_0: \beta = 0, the confidence interval should not include 0. In general:

  • If the confidence interval for β\beta does not contain 0 → you would reject H0H_0 at the corresponding significance level
  • If the confidence interval for β\beta does contain 0 → you would fail to reject H0H_0

Reading Computer Output

In practice, you will rarely compute regression inference by hand. Software (Excel, SPSS, R, Python, TI calculators) provides a regression output table. Here is how to read a typical one:

CoefficientStd Errort-StatisticP-value
Intercept53.6670.64083.8< 0.00001
Hours (xx)5.2860.16432.20.000004

Additional output:

StatisticValue
R2R^20.996
Adjusted R2R^20.995
Standard Error (ss)0.688
nn6

How to read this table:

  • Coefficient column: The intercept (a=53.667a = 53.667) and slope (b=5.286b = 5.286) of the regression line
  • Std Error column: The standard error of each coefficient. For the slope, SEb=0.164SE_b = 0.164
  • t-Statistic column: The test statistic for testing whether each coefficient equals zero. For the slope, t=32.2t = 32.2
  • P-value column: The p-value for the two-sided test H0:coefficient=0H_0: \text{coefficient} = 0. For the slope, p=0.000004p = 0.000004
  • R2R^2: The coefficient of determination — proportion of variation in yy explained by the model
  • Standard Error (ss): The regression standard error — the typical size of a residual

The slope row is the most important line. If the p-value for the slope is small (typically below 0.05), you conclude that there is a statistically significant linear relationship.

Connecting Everything: The Big Picture

Here is how correlation, regression, and inference fit together:

  1. Scatter plot — visualize the relationship
  2. Correlation (rr) — measure the strength and direction of the linear association
  3. Regression line (y^=a+bx\hat{y} = a + bx) — describe the relationship with an equation and make predictions
  4. R2R^2 — quantify how well the line fits the data
  5. Inference (tt-test and CI for slope) — determine whether the relationship is real or could be due to chance

Each step builds on the previous one. You should never skip straight to inference without first examining the scatter plot and checking the LINE conditions.

Real-World Application: Nursing — Does BMI Predict Blood Pressure?

A nurse researcher wants to know whether body mass index (BMI) is linearly associated with systolic blood pressure. She collects data from n=30n = 30 patients and runs a regression. The software output shows:

CoefficientStd Errort-StatisticP-value
Intercept82.48.79.470.0000
BMI (xx)1.850.325.780.0000

Additional: R2=0.544R^2 = 0.544, s=11.2s = 11.2, df=28df = 28.

Interpreting the output:

  • Slope: For each one-unit increase in BMI, systolic blood pressure is predicted to increase by 1.85 mmHg.
  • P-value for slope: The p-value is essentially 0, providing very strong evidence that BMI and systolic BP are linearly related.
  • R2=0.544R^2 = 0.544: About 54.4% of the variation in systolic blood pressure is explained by BMI. The remaining 45.6% is due to other factors (age, genetics, medication, diet, stress, etc.).
  • 95% CI for slope: 1.85±2.048(0.32)=1.85±0.66=(1.19, 2.51)1.85 \pm 2.048(0.32) = 1.85 \pm 0.66 = (1.19,\ 2.51). We are 95% confident the true increase in systolic BP per unit of BMI is between 1.19 and 2.51 mmHg.

Clinical implications: BMI is a statistically significant predictor of systolic blood pressure, but it explains only about half the variation. Nurses should consider BMI as one factor among many when assessing cardiovascular risk. A patient with a high BMI but normal blood pressure may have other protective factors, and a patient with a normal BMI but elevated blood pressure needs further evaluation.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: A regression analysis produces b=2.4b = 2.4, SEb=0.8SE_b = 0.8, and n=22n = 22. Test whether the slope is significantly different from zero at α=0.05\alpha = 0.05.

Step 1: State hypotheses: H0:β=0H_0: \beta = 0, Ha:β0H_a: \beta \neq 0

Step 2: Compute the test statistic.

t=bSEb=2.40.8=3.0t = \frac{b}{SE_b} = \frac{2.4}{0.8} = 3.0

Step 3: Degrees of freedom: df=222=20df = 22 - 2 = 20

Step 4: Find the p-value. For t=3.0t = 3.0 with df=20df = 20 (two-sided): p0.007p \approx 0.007

Step 5: Since 0.007<0.050.007 < 0.05, reject H0H_0.

Answer: There is statistically significant evidence of a linear relationship. The slope of 2.4 is significantly different from zero at the 5% significance level.

Problem 2: Using the values from Problem 1, construct a 95% confidence interval for the slope. The critical value is t=2.086t^* = 2.086 for df=20df = 20.

b±tSEb=2.4±2.086(0.8)=2.4±1.669b \pm t^* \cdot SE_b = 2.4 \pm 2.086(0.8) = 2.4 \pm 1.669

(0.73, 4.07)(0.73,\ 4.07)

Answer: We are 95% confident that the true slope is between 0.73 and 4.07. Since the interval does not contain 0, this is consistent with rejecting H0:β=0H_0: \beta = 0.

Problem 3: A residual plot shows residuals that fan out — small spread on the left and large spread on the right. Which LINE condition is violated, and what does this mean?

The E (Equal variance) condition is violated. This pattern is called heteroscedasticity. It means the variability of yy is not constant across all values of xx — the predictions are more precise for some xx-values than others.

Consequences: The standard errors, p-values, and confidence intervals from the regression may be unreliable. Possible remedies include transforming the response variable (e.g., using log(y)\log(y) instead of yy) or using weighted least squares regression.

Problem 4: A regression of advertising spending (xx) vs revenue (yy) for n=12n = 12 stores gives: b=4.5b = 4.5, SEb=1.2SE_b = 1.2, R2=0.58R^2 = 0.58. (a) Test the slope at α=0.05\alpha = 0.05. (b) Interpret R2R^2.

(a) t=4.51.2=3.75t = \frac{4.5}{1.2} = 3.75, df=10df = 10

For t=3.75t = 3.75 with df=10df = 10 (two-sided): p0.004p \approx 0.004

Since 0.004<0.050.004 < 0.05, reject H0H_0. There is significant evidence that advertising spending is linearly related to revenue.

(b) R2=0.58R^2 = 0.58 means that 58% of the variation in revenue is explained by the linear relationship with advertising spending. The remaining 42% is due to other factors such as location, competition, season, and store size.

Problem 5: A 95% confidence interval for a regression slope is (0.8,1.4)(-0.8, 1.4). What would the conclusion of the corresponding hypothesis test be at α=0.05\alpha = 0.05?

Since the confidence interval contains 0, we would fail to reject H0:β=0H_0: \beta = 0 at the α=0.05\alpha = 0.05 significance level.

Interpretation: There is not sufficient evidence of a linear relationship between xx and yy. The true slope could plausibly be zero (no relationship), negative, or positive — the data does not allow us to distinguish.

Key connection: A confidence interval that contains 0 always corresponds to a hypothesis test that fails to reject H0:β=0H_0: \beta = 0 at the same confidence level.

Key Takeaways

  • Inference for regression determines whether the observed linear relationship in a sample reflects a real relationship in the population or could be due to chance
  • The null hypothesis H0:β=0H_0: \beta = 0 states that the true population slope is zero (no linear relationship)
  • Before performing inference, verify the LINE conditions: Linearity, Independence, Normal residuals, and Equal variance
  • The standard error of the slope SEb=s/(xixˉ)2SE_b = s / \sqrt{\sum(x_i - \bar{x})^2} measures how much the sample slope would vary across different samples
  • The tt-test for the slope uses t=b/SEbt = b / SE_b with df=n2df = n - 2 — a large t|t| and small p-value provide evidence against H0H_0
  • A confidence interval for the slope (b±tSEbb \pm t^* \cdot SE_b) gives a range of plausible values for the true population slope
  • If the confidence interval for β\beta does not contain 0, you reject H0H_0; if it contains 0, you fail to reject
  • In computer output, the slope row is the most important — look at the coefficient, standard error, tt-statistic, and p-value
  • In clinical and applied settings, always consider both statistical significance (is the slope significantly different from zero?) and practical significance (is the effect large enough to matter?)

Return to Statistics for more topics in this section.

Last updated: March 29, 2026