Statistics

Linear Regression

Last updated: March 2026 · Advanced

Before you start

You should be comfortable with:

Correlation Scatter Plots

Real-world applications

💊

Nursing

Medication dosages, IV drip rates, vital monitoring

💰

Retail & Finance

Discounts, tax, tips, profit margins

Linear regression finds the straight line that best fits a set of data points. While correlation tells you how strong the linear relationship is, regression gives you the actual equation of the line — which lets you describe the relationship mathematically and make predictions. If you know a student studied 4.5 hours, regression tells you the predicted test score. This is one of the most widely used tools in all of statistics, from predicting sales revenue to estimating patient recovery times.

The Least-Squares Regression Line

The least-squares regression line (also called the line of best fit) is the line that minimizes the sum of the squared vertical distances from each data point to the line. In other words, it is the line where the total squared prediction error is as small as possible.

The equation of the regression line is:

$\hat{y} = a + bx$

Where:

$\hat{y}$ (“y-hat”) is the predicted value of $y$ for a given $x$
$b$ is the slope — the predicted change in $y$ for each one-unit increase in $x$
$a$ is the y-intercept — the predicted value of $y$ when $x = 0$

Formulas for Slope and Intercept

The slope $b$ is calculated from the data:

$b = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - \left(\sum x\right)^2}$

The intercept $a$ is calculated using the slope and the means of $x$ and $y$ :

$a = \bar{y} - b\bar{x}$

Notice that the numerator of the slope formula is the same as the numerator of the correlation coefficient formula. The difference is in the denominator: the slope uses only the $x$ -part of the denominator from the $r$ formula.

Calculating the Regression Line

Example 1: Study Hours vs Test Score

Using the same dataset from the correlation page — six students who reported their study hours and test scores:

Hours ( $x$ )	Score ( $y$ )
2	65
3	70
5	80
4	75
6	85
1	58

From the correlation calculation, we already know: $n = 6$ , $\sum x = 21$ , $\sum y = 433$ , $\sum xy = 1608$ , $\sum x^2 = 91$ .

Step 1: Calculate the slope.

$b = \frac{6(1608) - 21(433)}{6(91) - 21^2} = \frac{9648 - 9093}{546 - 441} = \frac{555}{105} \approx 5.286$

Step 2: Calculate the means.

$\bar{x} = \frac{21}{6} = 3.5 \qquad \bar{y} = \frac{433}{6} \approx 72.167$

Step 3: Calculate the intercept.

$a = 72.167 - 5.286(3.5) = 72.167 - 18.501 = 53.666$

Step 4: Write the regression equation.

$\hat{y} = 53.67 + 5.29x$

This equation describes the best-fitting line through the six data points.

Interpreting Slope and Intercept

Always interpret the slope and intercept in context — using the actual variable names and units from the problem.

Slope Interpretation

The slope $b = 5.29$ means: for each additional hour of study, the predicted test score increases by 5.29 points.

Notice the careful wording: “predicted” score, not “actual” score. The regression line gives predictions — individual students may score higher or lower than the prediction.

Intercept Interpretation

The intercept $a = 53.67$ means: a student who studies 0 hours is predicted to score 53.67 on the test.

However, be cautious about interpreting the intercept if $x = 0$ falls outside the range of your data. In this dataset, the lowest study time is 1 hour. The intercept is a mathematical necessity for defining the line, but it may not represent a meaningful real-world scenario.

Making Predictions

Once you have the regression equation, you can predict $y$ for any value of $x$ by substituting into the equation.

Example 2: Predict the Score for 4.5 Hours of Study

$\hat{y} = 53.67 + 5.29(4.5) = 53.67 + 23.805 = 77.475 \approx 77.48$

A student who studies 4.5 hours is predicted to score approximately 77.5 on the test.

Interpolation vs Extrapolation

Interpolation: Predicting within the range of the data (here, $x$ between 1 and 6). These predictions are generally reliable.
Extrapolation: Predicting outside the range of the data (e.g., predicting the score for $x = 15$ hours). Extrapolation is risky because the linear pattern may not continue beyond the observed data. A student studying 15 hours might experience diminishing returns from fatigue — the linear model cannot capture that.

Rule of thumb: Only use the regression line for predictions within (or very close to) the range of $x$ -values in your dataset.

Residuals

A residual is the difference between what actually happened and what the regression line predicted:

$\text{Residual} = y - \hat{y} = \text{observed} - \text{predicted}$

A positive residual means the actual value was higher than predicted (the point lies above the line)
A negative residual means the actual value was lower than predicted (the point lies below the line)
A residual of zero means the prediction was perfect (the point lies exactly on the line)

Example 3: Residual Table

Using $\hat{y} = 53.67 + 5.29x$ , compute the predicted score and residual for each student:

$x$	$y$ (observed)	$\hat{y}$ (predicted)	Residual ( $y - \hat{y}$ )
2	65	$53.67 + 10.58 = 64.25$	$65 - 64.25 = 0.75$
3	70	$53.67 + 15.87 = 69.54$	$70 - 69.54 = 0.46$
5	80	$53.67 + 26.45 = 80.12$	$80 - 80.12 = -0.12$
4	75	$53.67 + 21.16 = 74.83$	$75 - 74.83 = 0.17$
6	85	$53.67 + 31.74 = 85.41$	$85 - 85.41 = -0.41$
1	58	$53.67 + 5.29 = 58.96$	$58 - 58.96 = -0.96$

Notice that the residuals are a mix of positive and negative values, and they are all quite small — which makes sense given the very strong correlation ( $r = 0.998$ ) for this dataset.

Key property: The sum of all residuals for a least-squares regression line always equals zero (or very close to zero due to rounding): $0.75 + 0.46 + (-0.12) + 0.17 + (-0.41) + (-0.96) \approx -0.11$ (the small deviation is from rounding the slope and intercept).

Visualizing the Regression Line

The scatter plot below shows the six data points with the least-squares regression line drawn through them. Each teal dot is an observed data point, and the blue line is $\hat{y} = 53.67 + 5.29x$ .

Study Hours vs Test Score with Regression Line

The teal dots represent observed scores, and the blue line is the least-squares fit. Notice how closely the points hug the line — reflecting the very high correlation ( $r = 0.998$ ).

R-Squared: How Good Is the Fit?

The coefficient of determination $R^2$ tells you how well the regression line fits the data. It represents the proportion of the total variation in $y$ that is explained by the linear relationship with $x$ .

$R^2 = r^2$

For our study hours example:

$R^2 = 0.998^2 = 0.996$

This means 99.6% of the variation in test scores is explained by the linear relationship with study hours. Only 0.4% of the variation remains unexplained.

Interpreting R-squared Values

$R^2$	Interpretation
0.90 and above	Excellent fit — the line captures nearly all variation
0.70 to 0.90	Good fit — most variation is explained
0.40 to 0.70	Moderate fit — some variation explained
Below 0.40	Poor fit — most variation is unexplained

A low $R^2$ does not necessarily mean the model is useless — it means other variables (not included in the model) also affect $y$ . A high $R^2$ does not prove causation; it only confirms that the linear equation describes the pattern well.

Residual Plots: Checking Model Assumptions

A residual plot graphs the residuals (vertical axis) against the $x$ -values or predicted values (horizontal axis). It is the most important diagnostic tool for evaluating whether a linear model is appropriate.

What to look for:

Good pattern (linear model is appropriate): The residuals scatter randomly above and below zero with no obvious pattern, and the spread stays roughly constant across all $x$ -values.
Bad pattern — curved trend: If the residuals show a U-shape or inverted-U shape, the true relationship is probably curved, not linear. A linear model is not appropriate.
Bad pattern — fan shape: If the residuals spread out (or narrow) as $x$ increases, the equal-variance assumption is violated. This is called heteroscedasticity.

Key principle: Even if $R^2$ is high, always check the residual plot. A high $R^2$ does not guarantee that a linear model is the right choice — it is possible to get a deceptively high $R^2$ when the true relationship is curved.

Real-World Application: Nursing — Predicting Patient Recovery Time from Age

A hospital collects data on 8 patients who underwent the same knee replacement surgery, recording each patient’s age and recovery time (days until discharge):

Age ( $x$ )	Recovery Days ( $y$ )
30	3
38	4
42	4
45	5
52	6
55	7
60	8
65	9

A nurse researcher runs a regression and obtains $\hat{y} = -2.85 + 0.178x$ (with $R^2 = 0.97$ ).

Interpreting the slope: For each additional year of age, the predicted recovery time increases by 0.178 days. Equivalently, for every 5 to 6 additional years of age, recovery is predicted to increase by about 1 day.

Prediction: A 50-year-old patient would be predicted to need $-2.85 + 0.178(50) = -2.85 + 8.90 = 6.05$ days, or roughly 6 days for recovery.

Clinical use: This regression helps nurses plan staffing and discharge logistics. However, individual patients vary based on fitness, complications, and other factors that the model does not capture (the 3% unexplained variation, plus unmeasured variables).

Caution about extrapolation: Predicting recovery time for a 20-year-old ( $\hat{y} = -2.85 + 3.56 = 0.71$ days) or a 90-year-old ( $\hat{y} = -2.85 + 16.02 = 13.17$ days) may be unreliable because those ages fall outside the observed data range.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: Given

n = 5

\sum x = 15

\sum y = 50

\sum xy = 170

\sum x^2 = 55

. Find the equation of the least-squares regression line.

Step 1: Calculate the slope.

$b = \frac{5(170) - 15(50)}{5(55) - 15^2} = \frac{850 - 750}{275 - 225} = \frac{100}{50} = 2.0$

Step 2: Calculate the means.

$\bar{x} = \frac{15}{5} = 3.0 \qquad \bar{y} = \frac{50}{5} = 10.0$

Step 3: Calculate the intercept.

$a = 10.0 - 2.0(3.0) = 10.0 - 6.0 = 4.0$

Answer: $\hat{y} = 4.0 + 2.0x$

For each one-unit increase in $x$ , $y$ is predicted to increase by 2.0 units.

Problem 2: A regression equation is

\hat{y} = 120 - 3.5x

, where

x

is the number of absences and

y

is the final exam score. (a) Interpret the slope. (b) Predict the score for a student with 8 absences. (c) Would you trust a prediction for 40 absences?

(a) For each additional absence, the predicted final exam score decreases by 3.5 points.

(b) $\hat{y} = 120 - 3.5(8) = 120 - 28 = 92$ . A student with 8 absences is predicted to score 92.

(c) No. Predicting for 40 absences gives $\hat{y} = 120 - 3.5(40) = 120 - 140 = -20$ , which is a negative score — impossible. This is extrapolation far beyond the likely range of the data, and the linear trend clearly breaks down. The model is only valid within the range of $x$ -values in the original dataset.

Problem 3: A student scores 82 on an exam. The regression equation predicts

\hat{y} = 78

for that student’s study hours. Calculate and interpret the residual.

$\text{Residual} = y - \hat{y} = 82 - 78 = 4$

Interpretation: The student scored 4 points higher than the regression model predicted. The residual is positive, meaning the actual score was above the regression line.

Problem 4: A regression model has

R^2 = 0.64

. What does this tell you? If

r

is positive, what is the correlation coefficient?

$R^2 = 0.64$ means that 64% of the variation in the response variable is explained by the linear relationship with the explanatory variable. The remaining 36% is due to other factors.

If $r$ is positive:

$r = \sqrt{0.64} = 0.80$

Answer: The correlation coefficient is $r = 0.80$ , indicating a strong positive linear relationship.

Problem 5: A residual plot shows residuals that form a clear pattern: positive residuals on the left, negative in the middle, and positive on the right. What does this tell you about the linear model?

A U-shaped residual plot indicates that the linear model is not appropriate for this data. The true relationship between $x$ and $y$ is curved (likely quadratic or some other nonlinear form), not a straight line.

What to do: Consider fitting a nonlinear model (such as a quadratic regression), or transform one of the variables (such as taking the log or square root of $x$ or $y$ ) to straighten the relationship before fitting a line.

Even if the $R^2$ for the linear model appears decent, the systematic pattern in the residuals tells you the model is missing an important feature of the data.

Key Takeaways

The least-squares regression line $\hat{y} = a + bx$ minimizes the sum of squared residuals and provides the best linear fit to the data
The slope $b$ describes the predicted change in $y$ for each one-unit increase in $x$ — always interpret it in context using the actual variable names
The intercept $a$ is the predicted $y$ when $x = 0$ — it may or may not have a meaningful real-world interpretation
Residuals ( $y - \hat{y}$ ) measure how far each observed value falls from the regression line; they always sum to approximately zero
Interpolation (predicting within the data range) is generally reliable; extrapolation (predicting outside the data range) is risky and should be avoided
$R^2$ tells you the proportion of variation in $y$ explained by the linear model — but a high $R^2$ does not prove causation or guarantee the model is appropriate
Always examine a residual plot to check whether the linear model is a good fit — look for random scatter with constant spread
In healthcare and other applied fields, regression helps make predictions, but individual outcomes vary based on factors not captured by the model

Return to Statistics for more topics in this section.

Next Up in Statistics

Correlation Scatter Plots Addition Rule of Probability One-Way ANOVA

All Statistics topics

Last updated: March 29, 2026