Statistics

Correlation

Last updated: March 2026 · Intermediate

Before you start

You should be comfortable with:

Scatter Plots

Real-world applications

💊

Nursing

Medication dosages, IV drip rates, vital monitoring

💰

Retail & Finance

Discounts, tax, tips, profit margins

Correlation measures the strength and direction of the linear relationship between two quantitative variables. When you look at a scatter plot and see points trending upward or downward, correlation puts a number on that pattern. The result is a single value called the correlation coefficient, denoted $r$ , that tells you exactly how tightly the data follows a straight line — and in which direction.

The Correlation Coefficient (r)

The Pearson correlation coefficient $r$ is the most common measure of linear association. It ranges from $-1$ to $+1$ :

$r = +1$ : perfect positive linear relationship — every point falls exactly on a line that slopes upward
$r = -1$ : perfect negative linear relationship — every point falls exactly on a line that slopes downward
$r = 0$ : no linear relationship — the data shows no straight-line pattern (though there may be a curved relationship)

Values between these extremes indicate varying degrees of linear association. The closer $|r|$ is to 1, the stronger the linear relationship.

Interpreting the Strength of r

| $|r|$ Range | Interpretation | |---|---| | 0.0 to 0.3 | Weak linear relationship | | 0.3 to 0.7 | Moderate linear relationship | | 0.7 to 1.0 | Strong linear relationship |

These ranges are general guidelines, not strict cutoffs. In some fields (like physics), $r = 0.7$ might be considered weak, while in social sciences, $r = 0.5$ could be considered strong.

The Formula for r

The Pearson correlation coefficient is calculated using:

$r = \frac{n\sum xy - \sum x \sum y}{\sqrt{\left[n\sum x^2 - \left(\sum x\right)^2\right]\left[n\sum y^2 - \left(\sum y\right)^2\right]}}$

Where $n$ is the number of data pairs, $\sum xy$ is the sum of the products of each pair, $\sum x$ and $\sum y$ are the sums of the individual variables, and $\sum x^2$ and $\sum y^2$ are the sums of the squared values.

This formula looks intimidating, but it becomes manageable when you organize your work into a table and compute each sum systematically.

Calculating r Step by Step

Example 1: Study Hours vs Test Score

Six students reported their weekly study hours and their test scores:

Student	Hours ( $x$ )	Score ( $y$ )
A	2	65
B	3	70
C	5	80
D	4	75
E	6	85
F	1	58

Step 1: Compute the products and squares for each pair.

Student	$x$	$y$	$xy$	$x^2$	$y^2$
A	2	65	130	4	4225
B	3	70	210	9	4900
C	5	80	400	25	6400
D	4	75	300	16	5625
E	6	85	510	36	7225
F	1	58	58	1	3364
Sums	21	433	1608	91	31739

Step 2: Verify the sums.

$\sum x = 2 + 3 + 5 + 4 + 6 + 1 = 21$
$\sum y = 65 + 70 + 80 + 75 + 85 + 58 = 433$
$\sum xy = 130 + 210 + 400 + 300 + 510 + 58 = 1608$
$\sum x^2 = 4 + 9 + 25 + 16 + 36 + 1 = 91$
$\sum y^2 = 4225 + 4900 + 6400 + 5625 + 7225 + 3364 = 31739$

Step 3: Plug into the formula with $n = 6$ .

Numerator:

$n\sum xy - \sum x \sum y = 6(1608) - 21(433) = 9648 - 9093 = 555$

Denominator:

$\sqrt{\left[6(91) - 21^2\right]\left[6(31739) - 433^2\right]}$

$= \sqrt{[546 - 441][190434 - 187489]}$

$= \sqrt{105 \times 2945}$

$= \sqrt{309225} \approx 556.08$

Result:

$r = \frac{555}{556.08} \approx 0.998$

This is a very strong positive correlation. As study hours increase, test scores increase in a nearly perfect linear pattern.

The Coefficient of Determination (R-squared)

The coefficient of determination, written $R^2$ , is simply the square of the correlation coefficient:

$R^2 = r^2$

$R^2$ tells you the proportion of variation in $y$ that is explained by the linear relationship with $x$ . It answers the question: “How much of the ups and downs in $y$ can we account for by knowing $x$ ?”

Example 1 Continued

$R^2 = 0.998^2 \approx 0.996$

This means that 99.6% of the variation in test scores is explained by the linear relationship with study hours. Only 0.4% of the variation is due to other factors. This is an exceptionally high $R^2$ — in real-world data, values this high are uncommon.

Interpreting R-squared

$R^2$ Value	Interpretation
0.90 and above	Excellent — nearly all variation explained
0.70 to 0.90	Strong — most variation explained
0.40 to 0.70	Moderate — a good portion explained
Below 0.40	Weak — most variation is unexplained

Properties and Limitations of r

Understanding what $r$ can and cannot tell you is just as important as knowing how to calculate it.

What r Does Well

Has no units. The value of $r$ is the same whether you measure height in inches or centimeters, weight in pounds or kilograms. This makes it a standardized measure.
Does not change when you swap $x$ and $y$ . The correlation between study hours and test scores is the same as the correlation between test scores and study hours.
Does not change with linear transformations. If you convert temperatures from Fahrenheit to Celsius, $r$ stays the same.

Limitations of r

Sensitive to outliers. A single extreme point can dramatically change $r$ . One outlier can inflate a weak correlation into a strong one, or destroy a strong correlation.
Only detects linear patterns. If the true relationship is curved, quadratic, or exponential, $r$ will underestimate the strength of the association.
Does not imply causation. A high $r$ between two variables does not mean one causes the other (more on this below).
Requires quantitative variables. You cannot compute $r$ for categorical data like gender or favorite color.

Example 2: When r Misses the Pattern

Consider data points that form a perfect U-shape: as $x$ goes from 1 to 5, $y$ decreases, and as $x$ goes from 5 to 10, $y$ increases symmetrically. Despite this strong, clear pattern, $r$ would be approximately 0 because the relationship is not linear. Always look at the scatter plot before relying on $r$ alone.

Correlation Does Not Imply Causation

This is arguably the most important concept in statistics. Whenever you find a strong correlation, resist the urge to conclude that one variable causes the other. There are three possible explanations for any observed correlation:

1. Direct Causation

Variable $x$ actually causes changes in $y$ . Example: increasing the dosage of a medication (within therapeutic range) causes blood pressure to decrease. This can only be confirmed through a controlled experiment, not observational data alone.

2. Reverse Causation

Variable $y$ actually causes changes in $x$ , not the other way around. Example: you observe a positive correlation between the number of fire trucks at a scene and the amount of damage. It would be wrong to conclude that sending more trucks causes more damage — the reality is that larger fires (more damage) require more trucks.

3. Lurking (Confounding) Variable

A third variable that you did not measure causes both $x$ and $y$ to move together. This is the most common explanation for misleading correlations.

Classic examples:

Ice cream sales and drowning deaths are positively correlated. The lurking variable is summer heat — hot weather drives both ice cream purchases and swimming activity.
Shoe size and reading ability are positively correlated in children. The lurking variable is age — older children have both bigger feet and better reading skills.
Per capita cheese consumption and deaths by bedsheet tangling show a strikingly high correlation over time. This is pure coincidence — with enough variables and enough years of data, some pairs will correlate by chance.

The lesson: always ask “What else could explain this relationship?” before drawing causal conclusions. To establish causation, you need a randomized controlled experiment where one variable is deliberately manipulated while other factors are held constant.

Real-World Application: Nursing — Correlating Patient Variables

In clinical settings, nurses and researchers frequently use correlation to explore relationships between patient variables. Some examples:

Patient age and recovery time. A nurse researcher might find a moderate positive correlation ( $r \approx 0.6$ ) between patient age and the number of days to discharge after knee surgery. This does not mean age directly causes slower recovery — the lurking variables include overall fitness, pre-existing conditions, and immune system function.

Exercise hours and resting heart rate. A wellness program might find a moderate negative correlation ( $r \approx -0.5$ ) between weekly exercise hours and resting heart rate. More exercise is associated with a lower resting heart rate, consistent with cardiovascular conditioning.

Medication dosage and side effect severity. A pharmacist might track the correlation between dosage levels and reported side effect scores, using it to identify dosage ranges where side effects escalate rapidly.

In each case, correlation helps identify variables worth investigating further — but it never proves causation on its own. Clinical decisions require controlled trials, not just correlational evidence.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: Five data pairs are: (1, 10), (2, 20), (3, 30), (4, 40), (5, 50). Without calculating, what is

r

? Explain.

$r = +1$ (exactly).

Every point falls perfectly on the line $y = 10x$ . There is a perfect positive linear relationship with zero deviation from the line. Since all points lie exactly on a straight line with a positive slope, $r$ equals exactly $+1$ .

Problem 2: A study finds

r = -0.82

between hours of TV watched per day and GPA among college students. (a) Describe the relationship in words. (b) Can we conclude that watching TV causes lower grades?

(a) There is a strong negative linear relationship between daily TV hours and GPA. Students who watch more television tend to have lower grade point averages.

(b) No, we cannot conclude causation from correlation alone. Possible explanations include:

Direct causation: TV time displaces study time, leading to lower grades (plausible but not proven).
Reverse causation: Students struggling academically may turn to TV as an escape.
Lurking variable: Students with less motivation or poorer time management may both watch more TV and earn lower grades.

A randomized experiment would be needed to establish causation.

Problem 3: Calculate

r

for these four data pairs: (1, 8), (2, 6), (3, 4), (4, 2).

Compute the required sums with $n = 4$ :

$\sum x = 1 + 2 + 3 + 4 = 10$
$\sum y = 8 + 6 + 4 + 2 = 20$
$\sum xy = 8 + 12 + 12 + 8 = 40$
$\sum x^2 = 1 + 4 + 9 + 16 = 30$
$\sum y^2 = 64 + 36 + 16 + 4 = 120$

Numerator: $4(40) - 10(20) = 160 - 200 = -40$

Denominator: $\sqrt{[4(30) - 100][4(120) - 400]} = \sqrt{[120 - 100][480 - 400]} = \sqrt{20 \times 80} = \sqrt{1600} = 40$

$r = \frac{-40}{40} = -1$

Answer: $r = -1$ . This is a perfect negative linear relationship — every point falls exactly on the line $y = 10 - 2x$ .

Problem 4: A researcher finds

r = 0.45

between daily coffee consumption and productivity score. Calculate

R^2

and interpret it.

$R^2 = 0.45^2 = 0.2025$

Interpretation: About 20.3% of the variation in productivity scores is explained by the linear relationship with coffee consumption. The remaining 79.7% is due to other factors (sleep quality, workload, motivation, etc.).

This is a relatively modest $R^2$ , meaning coffee consumption alone is not a strong predictor of productivity, even though the moderate correlation suggests some association.

Problem 5: A dataset produces

r = 0.02

. A classmate says, “These variables are not related at all.” Is this statement accurate? Why or why not?

The statement is not entirely accurate. An $r$ value of 0.02 means there is essentially no linear relationship between the two variables. However, this does not rule out the possibility of a nonlinear relationship. The variables could be strongly related in a curved, cyclical, or other non-straight-line pattern.

The correct interpretation: There is no meaningful linear association, but you should examine the scatter plot to check for nonlinear patterns before concluding the variables are completely unrelated.

Key Takeaways

The correlation coefficient $r$ measures the strength and direction of a linear relationship between two quantitative variables, ranging from $-1$ to $+1$
Values of $|r|$ near 1 indicate strong linear association; values near 0 indicate weak or no linear association
The coefficient of determination $R^2 = r^2$ tells you the proportion of variation in $y$ explained by the linear relationship with $x$
$r$ measures linear relationships only — it can miss curved or nonlinear patterns entirely
$r$ is sensitive to outliers, has no units, and does not change when you swap $x$ and $y$
Correlation does not imply causation — a strong $r$ can result from direct causation, reverse causation, a lurking variable, or coincidence
Always examine the scatter plot alongside $r$ to check for nonlinear patterns, outliers, and other features that a single number cannot capture
To establish causation, you need a randomized controlled experiment, not just an observational correlation

Return to Statistics for more topics in this section.

Next Up in Statistics

Scatter Plots Addition Rule of Probability One-Way ANOVA Bayes' Theorem

All Statistics topics

Last updated: March 29, 2026