Statistics

Linear Regression

Last updated: March 2026 · Advanced
Before you start

You should be comfortable with:

Real-world applications
πŸ’Š
Nursing

Medication dosages, IV drip rates, vital monitoring

πŸ’°
Retail & Finance

Discounts, tax, tips, profit margins

Linear regression finds the straight line that best fits a set of data points. While correlation tells you how strong the linear relationship is, regression gives you the actual equation of the line β€” which lets you describe the relationship mathematically and make predictions. If you know a student studied 4.5 hours, regression tells you the predicted test score. This is one of the most widely used tools in all of statistics, from predicting sales revenue to estimating patient recovery times.

The Least-Squares Regression Line

The least-squares regression line (also called the line of best fit) is the line that minimizes the sum of the squared vertical distances from each data point to the line. In other words, it is the line where the total squared prediction error is as small as possible.

The equation of the regression line is:

y^=a+bx\hat{y} = a + bx

Where:

  • y^\hat{y} (β€œy-hat”) is the predicted value of yy for a given xx
  • bb is the slope β€” the predicted change in yy for each one-unit increase in xx
  • aa is the y-intercept β€” the predicted value of yy when x=0x = 0

Formulas for Slope and Intercept

The slope bb is calculated from the data:

b=nβˆ‘xyβˆ’βˆ‘xβˆ‘ynβˆ‘x2βˆ’(βˆ‘x)2b = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - \left(\sum x\right)^2}

The intercept aa is calculated using the slope and the means of xx and yy:

a=yΛ‰βˆ’bxΛ‰a = \bar{y} - b\bar{x}

Notice that the numerator of the slope formula is the same as the numerator of the correlation coefficient formula. The difference is in the denominator: the slope uses only the xx-part of the denominator from the rr formula.

Calculating the Regression Line

Example 1: Study Hours vs Test Score

Using the same dataset from the correlation page β€” six students who reported their study hours and test scores:

Hours (xx)Score (yy)
265
370
580
475
685
158

From the correlation calculation, we already know: n=6n = 6, βˆ‘x=21\sum x = 21, βˆ‘y=433\sum y = 433, βˆ‘xy=1608\sum xy = 1608, βˆ‘x2=91\sum x^2 = 91.

Step 1: Calculate the slope.

b=6(1608)βˆ’21(433)6(91)βˆ’212=9648βˆ’9093546βˆ’441=555105β‰ˆ5.286b = \frac{6(1608) - 21(433)}{6(91) - 21^2} = \frac{9648 - 9093}{546 - 441} = \frac{555}{105} \approx 5.286

Step 2: Calculate the means.

xΛ‰=216=3.5yΛ‰=4336β‰ˆ72.167\bar{x} = \frac{21}{6} = 3.5 \qquad \bar{y} = \frac{433}{6} \approx 72.167

Step 3: Calculate the intercept.

a=72.167βˆ’5.286(3.5)=72.167βˆ’18.501=53.666a = 72.167 - 5.286(3.5) = 72.167 - 18.501 = 53.666

Step 4: Write the regression equation.

y^=53.67+5.29x\hat{y} = 53.67 + 5.29x

This equation describes the best-fitting line through the six data points.

Interpreting Slope and Intercept

Always interpret the slope and intercept in context β€” using the actual variable names and units from the problem.

Slope Interpretation

The slope b=5.29b = 5.29 means: for each additional hour of study, the predicted test score increases by 5.29 points.

Notice the careful wording: β€œpredicted” score, not β€œactual” score. The regression line gives predictions β€” individual students may score higher or lower than the prediction.

Intercept Interpretation

The intercept a=53.67a = 53.67 means: a student who studies 0 hours is predicted to score 53.67 on the test.

However, be cautious about interpreting the intercept if x=0x = 0 falls outside the range of your data. In this dataset, the lowest study time is 1 hour. The intercept is a mathematical necessity for defining the line, but it may not represent a meaningful real-world scenario.

Making Predictions

Once you have the regression equation, you can predict yy for any value of xx by substituting into the equation.

Example 2: Predict the Score for 4.5 Hours of Study

y^=53.67+5.29(4.5)=53.67+23.805=77.475β‰ˆ77.48\hat{y} = 53.67 + 5.29(4.5) = 53.67 + 23.805 = 77.475 \approx 77.48

A student who studies 4.5 hours is predicted to score approximately 77.5 on the test.

Interpolation vs Extrapolation

  • Interpolation: Predicting within the range of the data (here, xx between 1 and 6). These predictions are generally reliable.
  • Extrapolation: Predicting outside the range of the data (e.g., predicting the score for x=15x = 15 hours). Extrapolation is risky because the linear pattern may not continue beyond the observed data. A student studying 15 hours might experience diminishing returns from fatigue β€” the linear model cannot capture that.

Rule of thumb: Only use the regression line for predictions within (or very close to) the range of xx-values in your dataset.

Residuals

A residual is the difference between what actually happened and what the regression line predicted:

Residual=yβˆ’y^=observedβˆ’predicted\text{Residual} = y - \hat{y} = \text{observed} - \text{predicted}

  • A positive residual means the actual value was higher than predicted (the point lies above the line)
  • A negative residual means the actual value was lower than predicted (the point lies below the line)
  • A residual of zero means the prediction was perfect (the point lies exactly on the line)

Example 3: Residual Table

Using y^=53.67+5.29x\hat{y} = 53.67 + 5.29x, compute the predicted score and residual for each student:

xxyy (observed)y^\hat{y} (predicted)Residual (yβˆ’y^y - \hat{y})
26553.67+10.58=64.2553.67 + 10.58 = 64.2565βˆ’64.25=0.7565 - 64.25 = 0.75
37053.67+15.87=69.5453.67 + 15.87 = 69.5470βˆ’69.54=0.4670 - 69.54 = 0.46
58053.67+26.45=80.1253.67 + 26.45 = 80.1280βˆ’80.12=βˆ’0.1280 - 80.12 = -0.12
47553.67+21.16=74.8353.67 + 21.16 = 74.8375βˆ’74.83=0.1775 - 74.83 = 0.17
68553.67+31.74=85.4153.67 + 31.74 = 85.4185βˆ’85.41=βˆ’0.4185 - 85.41 = -0.41
15853.67+5.29=58.9653.67 + 5.29 = 58.9658βˆ’58.96=βˆ’0.9658 - 58.96 = -0.96

Notice that the residuals are a mix of positive and negative values, and they are all quite small β€” which makes sense given the very strong correlation (r=0.998r = 0.998) for this dataset.

Key property: The sum of all residuals for a least-squares regression line always equals zero (or very close to zero due to rounding): 0.75+0.46+(βˆ’0.12)+0.17+(βˆ’0.41)+(βˆ’0.96)β‰ˆβˆ’0.110.75 + 0.46 + (-0.12) + 0.17 + (-0.41) + (-0.96) \approx -0.11 (the small deviation is from rounding the slope and intercept).

Visualizing the Regression Line

The scatter plot below shows the six data points with the least-squares regression line drawn through them. Each teal dot is an observed data point, and the blue line is y^=53.67+5.29x\hat{y} = 53.67 + 5.29x.

Study Hours vs Test Score with Regression Line

012345675060708090Study HoursTest Scorey = 53.67 + 5.29x

The teal dots represent observed scores, and the blue line is the least-squares fit. Notice how closely the points hug the line β€” reflecting the very high correlation (r=0.998r = 0.998).

R-Squared: How Good Is the Fit?

The coefficient of determination R2R^2 tells you how well the regression line fits the data. It represents the proportion of the total variation in yy that is explained by the linear relationship with xx.

R2=r2R^2 = r^2

For our study hours example:

R2=0.9982=0.996R^2 = 0.998^2 = 0.996

This means 99.6% of the variation in test scores is explained by the linear relationship with study hours. Only 0.4% of the variation remains unexplained.

Interpreting R-squared Values

R2R^2Interpretation
0.90 and aboveExcellent fit β€” the line captures nearly all variation
0.70 to 0.90Good fit β€” most variation is explained
0.40 to 0.70Moderate fit β€” some variation explained
Below 0.40Poor fit β€” most variation is unexplained

A low R2R^2 does not necessarily mean the model is useless β€” it means other variables (not included in the model) also affect yy. A high R2R^2 does not prove causation; it only confirms that the linear equation describes the pattern well.

Residual Plots: Checking Model Assumptions

A residual plot graphs the residuals (vertical axis) against the xx-values or predicted values (horizontal axis). It is the most important diagnostic tool for evaluating whether a linear model is appropriate.

What to look for:

  • Good pattern (linear model is appropriate): The residuals scatter randomly above and below zero with no obvious pattern, and the spread stays roughly constant across all xx-values.
  • Bad pattern β€” curved trend: If the residuals show a U-shape or inverted-U shape, the true relationship is probably curved, not linear. A linear model is not appropriate.
  • Bad pattern β€” fan shape: If the residuals spread out (or narrow) as xx increases, the equal-variance assumption is violated. This is called heteroscedasticity.

Key principle: Even if R2R^2 is high, always check the residual plot. A high R2R^2 does not guarantee that a linear model is the right choice β€” it is possible to get a deceptively high R2R^2 when the true relationship is curved.

Real-World Application: Nursing β€” Predicting Patient Recovery Time from Age

A hospital collects data on 8 patients who underwent the same knee replacement surgery, recording each patient’s age and recovery time (days until discharge):

Age (xx)Recovery Days (yy)
303
384
424
455
526
557
608
659

A nurse researcher runs a regression and obtains y^=βˆ’2.85+0.178x\hat{y} = -2.85 + 0.178x (with R2=0.97R^2 = 0.97).

Interpreting the slope: For each additional year of age, the predicted recovery time increases by 0.178 days. Equivalently, for every 5 to 6 additional years of age, recovery is predicted to increase by about 1 day.

Prediction: A 50-year-old patient would be predicted to need βˆ’2.85+0.178(50)=βˆ’2.85+8.90=6.05-2.85 + 0.178(50) = -2.85 + 8.90 = 6.05 days, or roughly 6 days for recovery.

Clinical use: This regression helps nurses plan staffing and discharge logistics. However, individual patients vary based on fitness, complications, and other factors that the model does not capture (the 3% unexplained variation, plus unmeasured variables).

Caution about extrapolation: Predicting recovery time for a 20-year-old (y^=βˆ’2.85+3.56=0.71\hat{y} = -2.85 + 3.56 = 0.71 days) or a 90-year-old (y^=βˆ’2.85+16.02=13.17\hat{y} = -2.85 + 16.02 = 13.17 days) may be unreliable because those ages fall outside the observed data range.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: Given n=5n = 5, βˆ‘x=15\sum x = 15, βˆ‘y=50\sum y = 50, βˆ‘xy=170\sum xy = 170, βˆ‘x2=55\sum x^2 = 55. Find the equation of the least-squares regression line.

Step 1: Calculate the slope.

b=5(170)βˆ’15(50)5(55)βˆ’152=850βˆ’750275βˆ’225=10050=2.0b = \frac{5(170) - 15(50)}{5(55) - 15^2} = \frac{850 - 750}{275 - 225} = \frac{100}{50} = 2.0

Step 2: Calculate the means.

xˉ=155=3.0yˉ=505=10.0\bar{x} = \frac{15}{5} = 3.0 \qquad \bar{y} = \frac{50}{5} = 10.0

Step 3: Calculate the intercept.

a=10.0βˆ’2.0(3.0)=10.0βˆ’6.0=4.0a = 10.0 - 2.0(3.0) = 10.0 - 6.0 = 4.0

Answer: y^=4.0+2.0x\hat{y} = 4.0 + 2.0x

For each one-unit increase in xx, yy is predicted to increase by 2.0 units.

Problem 2: A regression equation is y^=120βˆ’3.5x\hat{y} = 120 - 3.5x, where xx is the number of absences and yy is the final exam score. (a) Interpret the slope. (b) Predict the score for a student with 8 absences. (c) Would you trust a prediction for 40 absences?

(a) For each additional absence, the predicted final exam score decreases by 3.5 points.

(b) y^=120βˆ’3.5(8)=120βˆ’28=92\hat{y} = 120 - 3.5(8) = 120 - 28 = 92. A student with 8 absences is predicted to score 92.

(c) No. Predicting for 40 absences gives y^=120βˆ’3.5(40)=120βˆ’140=βˆ’20\hat{y} = 120 - 3.5(40) = 120 - 140 = -20, which is a negative score β€” impossible. This is extrapolation far beyond the likely range of the data, and the linear trend clearly breaks down. The model is only valid within the range of xx-values in the original dataset.

Problem 3: A student scores 82 on an exam. The regression equation predicts y^=78\hat{y} = 78 for that student’s study hours. Calculate and interpret the residual.

Residual=yβˆ’y^=82βˆ’78=4\text{Residual} = y - \hat{y} = 82 - 78 = 4

Interpretation: The student scored 4 points higher than the regression model predicted. The residual is positive, meaning the actual score was above the regression line.

Problem 4: A regression model has R2=0.64R^2 = 0.64. What does this tell you? If rr is positive, what is the correlation coefficient?

R2=0.64R^2 = 0.64 means that 64% of the variation in the response variable is explained by the linear relationship with the explanatory variable. The remaining 36% is due to other factors.

If rr is positive:

r=0.64=0.80r = \sqrt{0.64} = 0.80

Answer: The correlation coefficient is r=0.80r = 0.80, indicating a strong positive linear relationship.

Problem 5: A residual plot shows residuals that form a clear pattern: positive residuals on the left, negative in the middle, and positive on the right. What does this tell you about the linear model?

A U-shaped residual plot indicates that the linear model is not appropriate for this data. The true relationship between xx and yy is curved (likely quadratic or some other nonlinear form), not a straight line.

What to do: Consider fitting a nonlinear model (such as a quadratic regression), or transform one of the variables (such as taking the log or square root of xx or yy) to straighten the relationship before fitting a line.

Even if the R2R^2 for the linear model appears decent, the systematic pattern in the residuals tells you the model is missing an important feature of the data.

Key Takeaways

  • The least-squares regression line y^=a+bx\hat{y} = a + bx minimizes the sum of squared residuals and provides the best linear fit to the data
  • The slope bb describes the predicted change in yy for each one-unit increase in xx β€” always interpret it in context using the actual variable names
  • The intercept aa is the predicted yy when x=0x = 0 β€” it may or may not have a meaningful real-world interpretation
  • Residuals (yβˆ’y^y - \hat{y}) measure how far each observed value falls from the regression line; they always sum to approximately zero
  • Interpolation (predicting within the data range) is generally reliable; extrapolation (predicting outside the data range) is risky and should be avoided
  • R2R^2 tells you the proportion of variation in yy explained by the linear model β€” but a high R2R^2 does not prove causation or guarantee the model is appropriate
  • Always examine a residual plot to check whether the linear model is a good fit β€” look for random scatter with constant spread
  • In healthcare and other applied fields, regression helps make predictions, but individual outcomes vary based on factors not captured by the model

Return to Statistics for more topics in this section.

Last updated: March 29, 2026