Statistics

Inference for Regression

Last updated: March 2026 · Advanced

Before you start

You should be comfortable with:

Introduction to Hypothesis Testing Linear Regression

Real-world applications

💊

Nursing

Medication dosages, IV drip rates, vital monitoring

So far, we have calculated the regression line from sample data — finding the slope, intercept, and $R^2$ . But there is a deeper question: is the linear relationship real, or could it be due to chance? Even if two variables have no relationship in the population, a random sample could produce a nonzero slope just from sampling variability. Inference for regression uses hypothesis tests and confidence intervals to determine whether the observed slope reflects a genuine linear relationship in the population.

The Question: Is There a Real Linear Relationship?

The sample slope $b$ estimates the true population slope $\beta$ (the Greek letter beta). If there is no linear relationship in the population, then $\beta = 0$ — the regression line would be flat. Our goal is to test whether the data provides convincing evidence that $\beta$ is not zero.

The hypotheses are:

$H_0: \beta = 0 \quad \text{(no linear relationship — the true slope is zero)}$

$H_a: \beta \neq 0 \quad \text{(there IS a linear relationship — the true slope is not zero)}$

If we reject $H_0$ , we have evidence that $x$ and $y$ are linearly related in the population. If we fail to reject $H_0$ , the apparent trend in our sample could be due to random variation — we cannot conclude that a linear relationship exists.

Conditions for Regression Inference

Before performing inference, you must verify that four conditions are met. A useful mnemonic is LINE:

L — Linearity: The true relationship between $x$ and $y$ is linear. Check this by examining the scatter plot and the residual plot for any curved pattern.
I — Independence: The observations are independent of each other. This is usually satisfied when data comes from a random sample, or when the sample size is less than 10% of the population.
N — Normality of residuals: The residuals follow an approximately normal distribution at each value of $x$ . Check with a histogram or normal probability plot of the residuals. This condition is less critical for large samples due to the Central Limit Theorem.
E — Equal variance (homoscedasticity): The spread of the residuals is roughly the same across all values of $x$ . Check the residual plot — the vertical spread of dots should not fan out or narrow as $x$ increases.

If any of these conditions is seriously violated, the p-values and confidence intervals from the regression may not be trustworthy.

The Regression Standard Error

Before we can test the slope, we need a measure of how much the data points scatter around the regression line. The regression standard error (also called the standard error of the estimate or residual standard error) is:

$s = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n - 2}}$

This is essentially the “typical” size of a residual. We divide by $n - 2$ (not $n$ ) because we lose two degrees of freedom — one for estimating the slope and one for estimating the intercept.

The Standard Error of the Slope

The standard error of the slope measures how much the sample slope $b$ would vary from sample to sample:

$SE_b = \frac{s}{\sqrt{\sum(x_i - \bar{x})^2}}$

A smaller $SE_b$ means the slope estimate is more precise. Two things reduce $SE_b$ :

Smaller residuals (smaller $s$ ) — the data fits the line more tightly
More spread in the $x$ -values (larger $\sum(x_i - \bar{x})^2$ ) — the data covers a wider range of $x$

The degrees of freedom for this estimate are $df = n - 2$ .

T-Test for the Slope

The test statistic for testing $H_0: \beta = 0$ follows a $t$ -distribution:

$t = \frac{b - 0}{SE_b} = \frac{b}{SE_b}$

This has the familiar form: (estimate minus null value) divided by standard error. A large $|t|$ means the observed slope is many standard errors away from zero, which is strong evidence against $H_0$ .

Example 1: Testing the Study Hours Slope

Using the study hours vs test score data from the linear regression page ( $n = 6$ , $b = 5.286$ ), we need to compute the standard error of the slope.

Step 1: Gather the residuals (computed on the linear regression page).

$x$	$y$	$\hat{y}$	Residual ( $y - \hat{y}$ )
2	65	64.24	0.76
3	70	69.53	0.47
5	80	80.10	$-0.10$
4	75	74.82	0.18
6	85	85.39	$-0.39$
1	58	58.95	$-0.95$

Step 2: Compute the sum of squared residuals.

$\sum(y_i - \hat{y}_i)^2 = 0.76^2 + 0.47^2 + 0.10^2 + 0.18^2 + 0.39^2 + 0.95^2$

$= 0.578 + 0.221 + 0.010 + 0.032 + 0.152 + 0.903 = 1.896$

Step 3: Compute the regression standard error ( $df = n - 2 = 4$ ).

$s = \sqrt{\frac{1.896}{6 - 2}} = \sqrt{\frac{1.896}{4}} = \sqrt{0.474} = 0.688$

Step 4: Compute $\sum(x_i - \bar{x})^2$ where $\bar{x} = 3.5$ .

$\sum(x_i - \bar{x})^2 = (2 - 3.5)^2 + (3 - 3.5)^2 + (5 - 3.5)^2 + (4 - 3.5)^2 + (6 - 3.5)^2 + (1 - 3.5)^2$

$= 2.25 + 0.25 + 2.25 + 0.25 + 6.25 + 6.25 = 17.5$

Step 5: Compute the standard error of the slope.

$SE_b = \frac{0.688}{\sqrt{17.5}} = \frac{0.688}{4.183} = 0.164$

Step 6: Compute the $t$ -statistic.

$t = \frac{5.286}{0.164} = 32.2$

Step 7: Find the p-value. With $df = 4$ and $t = 32.2$ , the p-value is astronomically small — essentially 0. (For reference, the critical value for a two-sided test at $\alpha = 0.05$ with $df = 4$ is $t^* = 2.776$ . Our test statistic of 32.2 far exceeds this threshold.)

Conclusion: There is overwhelming evidence of a linear relationship between study hours and test scores. The extremely large $t$ -statistic (and essentially zero p-value) means it is virtually impossible that a slope this steep would arise by chance if no linear relationship existed.

Confidence Interval for the Slope

A confidence interval gives a range of plausible values for the true population slope $\beta$ :

$b \pm t^* \cdot SE_b$

Where $t^*$ is the critical value from the $t$ -distribution with $df = n - 2$ for the desired confidence level.

Example 2: 95% Confidence Interval for the Slope

Using the values from Example 1: $b = 5.286$ , $SE_b = 0.164$ , $df = 4$ .

For a 95% confidence interval with $df = 4$ , the critical value is $t^* = 2.776$ .

$5.286 \pm 2.776(0.164) = 5.286 \pm 0.455$

$(4.83,\ 5.74)$

Interpretation: We are 95% confident that for each additional hour of study, the true average test score increase is between 4.83 and 5.74 points.

Notice that this interval does not contain 0. This is consistent with our hypothesis test result — since we rejected $H_0: \beta = 0$ , the confidence interval should not include 0. In general:

If the confidence interval for $\beta$ does not contain 0 → you would reject $H_0$ at the corresponding significance level
If the confidence interval for $\beta$ does contain 0 → you would fail to reject $H_0$

Reading Computer Output

In practice, you will rarely compute regression inference by hand. Software (Excel, SPSS, R, Python, TI calculators) provides a regression output table. Here is how to read a typical one:

	Coefficient	Std Error	t-Statistic	P-value
Intercept	53.667	0.640	83.8	< 0.00001
Hours ( $x$ )	5.286	0.164	32.2	0.000004

Additional output:

Statistic	Value
$R^2$	0.996
Adjusted $R^2$	0.995
Standard Error ( $s$ )	0.688
$n$	6

How to read this table:

Coefficient column: The intercept ( $a = 53.667$ ) and slope ( $b = 5.286$ ) of the regression line
Std Error column: The standard error of each coefficient. For the slope, $SE_b = 0.164$
t-Statistic column: The test statistic for testing whether each coefficient equals zero. For the slope, $t = 32.2$
P-value column: The p-value for the two-sided test $H_0: \text{coefficient} = 0$ . For the slope, $p = 0.000004$
$R^2$ : The coefficient of determination — proportion of variation in $y$ explained by the model
Standard Error ( $s$ ): The regression standard error — the typical size of a residual

The slope row is the most important line. If the p-value for the slope is small (typically below 0.05), you conclude that there is a statistically significant linear relationship.

Connecting Everything: The Big Picture

Here is how correlation, regression, and inference fit together:

Scatter plot — visualize the relationship
Correlation ( $r$ ) — measure the strength and direction of the linear association
Regression line ( $\hat{y} = a + bx$ ) — describe the relationship with an equation and make predictions
$R^2$ — quantify how well the line fits the data
Inference ( $t$ -test and CI for slope) — determine whether the relationship is real or could be due to chance

Each step builds on the previous one. You should never skip straight to inference without first examining the scatter plot and checking the LINE conditions.

Real-World Application: Nursing — Does BMI Predict Blood Pressure?

A nurse researcher wants to know whether body mass index (BMI) is linearly associated with systolic blood pressure. She collects data from $n = 30$ patients and runs a regression. The software output shows:

	Coefficient	Std Error	t-Statistic	P-value
Intercept	82.4	8.7	9.47	0.0000
BMI ( $x$ )	1.85	0.32	5.78	0.0000

Additional: $R^2 = 0.544$ , $s = 11.2$ , $df = 28$ .

Interpreting the output:

Slope: For each one-unit increase in BMI, systolic blood pressure is predicted to increase by 1.85 mmHg.
P-value for slope: The p-value is essentially 0, providing very strong evidence that BMI and systolic BP are linearly related.
$R^2 = 0.544$ : About 54.4% of the variation in systolic blood pressure is explained by BMI. The remaining 45.6% is due to other factors (age, genetics, medication, diet, stress, etc.).
95% CI for slope: $1.85 \pm 2.048(0.32) = 1.85 \pm 0.66 = (1.19,\ 2.51)$ . We are 95% confident the true increase in systolic BP per unit of BMI is between 1.19 and 2.51 mmHg.

Clinical implications: BMI is a statistically significant predictor of systolic blood pressure, but it explains only about half the variation. Nurses should consider BMI as one factor among many when assessing cardiovascular risk. A patient with a high BMI but normal blood pressure may have other protective factors, and a patient with a normal BMI but elevated blood pressure needs further evaluation.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: A regression analysis produces

b = 2.4

SE_b = 0.8

, and

n = 22

. Test whether the slope is significantly different from zero at

\alpha = 0.05

Step 1: State hypotheses: $H_0: \beta = 0$ , $H_a: \beta \neq 0$

Step 2: Compute the test statistic.

$t = \frac{b}{SE_b} = \frac{2.4}{0.8} = 3.0$

Step 3: Degrees of freedom: $df = 22 - 2 = 20$

Step 4: Find the p-value. For $t = 3.0$ with $df = 20$ (two-sided): $p \approx 0.007$

Step 5: Since $0.007 < 0.05$ , reject $H_0$ .

Answer: There is statistically significant evidence of a linear relationship. The slope of 2.4 is significantly different from zero at the 5% significance level.

Problem 2: Using the values from Problem 1, construct a 95% confidence interval for the slope. The critical value is

t^* = 2.086

for

df = 20

$b \pm t^* \cdot SE_b = 2.4 \pm 2.086(0.8) = 2.4 \pm 1.669$

$(0.73,\ 4.07)$

Answer: We are 95% confident that the true slope is between 0.73 and 4.07. Since the interval does not contain 0, this is consistent with rejecting $H_0: \beta = 0$ .

Problem 3: A residual plot shows residuals that fan out — small spread on the left and large spread on the right. Which LINE condition is violated, and what does this mean?

The E (Equal variance) condition is violated. This pattern is called heteroscedasticity. It means the variability of $y$ is not constant across all values of $x$ — the predictions are more precise for some $x$ -values than others.

Consequences: The standard errors, p-values, and confidence intervals from the regression may be unreliable. Possible remedies include transforming the response variable (e.g., using $\log(y)$ instead of $y$ ) or using weighted least squares regression.

Problem 4: A regression of advertising spending (

x

) vs revenue (

y

) for

n = 12

stores gives:

b = 4.5

SE_b = 1.2

R^2 = 0.58

. (a) Test the slope at

\alpha = 0.05

. (b) Interpret

R^2

(a) $t = \frac{4.5}{1.2} = 3.75$ , $df = 10$

For $t = 3.75$ with $df = 10$ (two-sided): $p \approx 0.004$

Since $0.004 < 0.05$ , reject $H_0$ . There is significant evidence that advertising spending is linearly related to revenue.

(b) $R^2 = 0.58$ means that 58% of the variation in revenue is explained by the linear relationship with advertising spending. The remaining 42% is due to other factors such as location, competition, season, and store size.

Problem 5: A 95% confidence interval for a regression slope is

(-0.8, 1.4)

. What would the conclusion of the corresponding hypothesis test be at

\alpha = 0.05

Since the confidence interval contains 0, we would fail to reject $H_0: \beta = 0$ at the $\alpha = 0.05$ significance level.

Interpretation: There is not sufficient evidence of a linear relationship between $x$ and $y$ . The true slope could plausibly be zero (no relationship), negative, or positive — the data does not allow us to distinguish.

Key connection: A confidence interval that contains 0 always corresponds to a hypothesis test that fails to reject $H_0: \beta = 0$ at the same confidence level.

Key Takeaways

Inference for regression determines whether the observed linear relationship in a sample reflects a real relationship in the population or could be due to chance
The null hypothesis $H_0: \beta = 0$ states that the true population slope is zero (no linear relationship)
Before performing inference, verify the LINE conditions: Linearity, Independence, Normal residuals, and Equal variance
The standard error of the slope $SE_b = s / \sqrt{\sum(x_i - \bar{x})^2}$ measures how much the sample slope would vary across different samples
The $t$ -test for the slope uses $t = b / SE_b$ with $df = n - 2$ — a large $|t|$ and small p-value provide evidence against $H_0$
A confidence interval for the slope ( $b \pm t^* \cdot SE_b$ ) gives a range of plausible values for the true population slope
If the confidence interval for $\beta$ does not contain 0, you reject $H_0$ ; if it contains 0, you fail to reject
In computer output, the slope row is the most important — look at the coefficient, standard error, $t$ -statistic, and p-value
In clinical and applied settings, always consider both statistical significance (is the slope significantly different from zero?) and practical significance (is the effect large enough to matter?)

Return to Statistics for more topics in this section.

Next Up in Statistics

Introduction to Hypothesis Testing Linear Regression Addition Rule of Probability One-Way ANOVA

All Statistics topics

Last updated: March 29, 2026