Correlation
Medication dosages, IV drip rates, vital monitoring
Discounts, tax, tips, profit margins
Correlation measures the strength and direction of the linear relationship between two quantitative variables. When you look at a scatter plot and see points trending upward or downward, correlation puts a number on that pattern. The result is a single value called the correlation coefficient, denoted , that tells you exactly how tightly the data follows a straight line β and in which direction.
The Correlation Coefficient (r)
The Pearson correlation coefficient is the most common measure of linear association. It ranges from to :
- : perfect positive linear relationship β every point falls exactly on a line that slopes upward
- : perfect negative linear relationship β every point falls exactly on a line that slopes downward
- : no linear relationship β the data shows no straight-line pattern (though there may be a curved relationship)
Values between these extremes indicate varying degrees of linear association. The closer is to 1, the stronger the linear relationship.
Interpreting the Strength of r
| Range | Interpretation | |---|---| | 0.0 to 0.3 | Weak linear relationship | | 0.3 to 0.7 | Moderate linear relationship | | 0.7 to 1.0 | Strong linear relationship |
These ranges are general guidelines, not strict cutoffs. In some fields (like physics), might be considered weak, while in social sciences, could be considered strong.
The Formula for r
The Pearson correlation coefficient is calculated using:
Where is the number of data pairs, is the sum of the products of each pair, and are the sums of the individual variables, and and are the sums of the squared values.
This formula looks intimidating, but it becomes manageable when you organize your work into a table and compute each sum systematically.
Calculating r Step by Step
Example 1: Study Hours vs Test Score
Six students reported their weekly study hours and their test scores:
| Student | Hours () | Score () |
|---|---|---|
| A | 2 | 65 |
| B | 3 | 70 |
| C | 5 | 80 |
| D | 4 | 75 |
| E | 6 | 85 |
| F | 1 | 58 |
Step 1: Compute the products and squares for each pair.
| Student | |||||
|---|---|---|---|---|---|
| A | 2 | 65 | 130 | 4 | 4225 |
| B | 3 | 70 | 210 | 9 | 4900 |
| C | 5 | 80 | 400 | 25 | 6400 |
| D | 4 | 75 | 300 | 16 | 5625 |
| E | 6 | 85 | 510 | 36 | 7225 |
| F | 1 | 58 | 58 | 1 | 3364 |
| Sums | 21 | 433 | 1608 | 91 | 31739 |
Step 2: Verify the sums.
Step 3: Plug into the formula with .
Numerator:
Denominator:
Result:
This is a very strong positive correlation. As study hours increase, test scores increase in a nearly perfect linear pattern.
The Coefficient of Determination (R-squared)
The coefficient of determination, written , is simply the square of the correlation coefficient:
tells you the proportion of variation in that is explained by the linear relationship with . It answers the question: βHow much of the ups and downs in can we account for by knowing ?β
Example 1 Continued
This means that 99.6% of the variation in test scores is explained by the linear relationship with study hours. Only 0.4% of the variation is due to other factors. This is an exceptionally high β in real-world data, values this high are uncommon.
Interpreting R-squared
| Value | Interpretation |
|---|---|
| 0.90 and above | Excellent β nearly all variation explained |
| 0.70 to 0.90 | Strong β most variation explained |
| 0.40 to 0.70 | Moderate β a good portion explained |
| Below 0.40 | Weak β most variation is unexplained |
Properties and Limitations of r
Understanding what can and cannot tell you is just as important as knowing how to calculate it.
What r Does Well
- Has no units. The value of is the same whether you measure height in inches or centimeters, weight in pounds or kilograms. This makes it a standardized measure.
- Does not change when you swap and . The correlation between study hours and test scores is the same as the correlation between test scores and study hours.
- Does not change with linear transformations. If you convert temperatures from Fahrenheit to Celsius, stays the same.
Limitations of r
- Sensitive to outliers. A single extreme point can dramatically change . One outlier can inflate a weak correlation into a strong one, or destroy a strong correlation.
- Only detects linear patterns. If the true relationship is curved, quadratic, or exponential, will underestimate the strength of the association.
- Does not imply causation. A high between two variables does not mean one causes the other (more on this below).
- Requires quantitative variables. You cannot compute for categorical data like gender or favorite color.
Example 2: When r Misses the Pattern
Consider data points that form a perfect U-shape: as goes from 1 to 5, decreases, and as goes from 5 to 10, increases symmetrically. Despite this strong, clear pattern, would be approximately 0 because the relationship is not linear. Always look at the scatter plot before relying on alone.
Correlation Does Not Imply Causation
This is arguably the most important concept in statistics. Whenever you find a strong correlation, resist the urge to conclude that one variable causes the other. There are three possible explanations for any observed correlation:
1. Direct Causation
Variable actually causes changes in . Example: increasing the dosage of a medication (within therapeutic range) causes blood pressure to decrease. This can only be confirmed through a controlled experiment, not observational data alone.
2. Reverse Causation
Variable actually causes changes in , not the other way around. Example: you observe a positive correlation between the number of fire trucks at a scene and the amount of damage. It would be wrong to conclude that sending more trucks causes more damage β the reality is that larger fires (more damage) require more trucks.
3. Lurking (Confounding) Variable
A third variable that you did not measure causes both and to move together. This is the most common explanation for misleading correlations.
Classic examples:
- Ice cream sales and drowning deaths are positively correlated. The lurking variable is summer heat β hot weather drives both ice cream purchases and swimming activity.
- Shoe size and reading ability are positively correlated in children. The lurking variable is age β older children have both bigger feet and better reading skills.
- Per capita cheese consumption and deaths by bedsheet tangling show a strikingly high correlation over time. This is pure coincidence β with enough variables and enough years of data, some pairs will correlate by chance.
The lesson: always ask βWhat else could explain this relationship?β before drawing causal conclusions. To establish causation, you need a randomized controlled experiment where one variable is deliberately manipulated while other factors are held constant.
Real-World Application: Nursing β Correlating Patient Variables
In clinical settings, nurses and researchers frequently use correlation to explore relationships between patient variables. Some examples:
Patient age and recovery time. A nurse researcher might find a moderate positive correlation () between patient age and the number of days to discharge after knee surgery. This does not mean age directly causes slower recovery β the lurking variables include overall fitness, pre-existing conditions, and immune system function.
Exercise hours and resting heart rate. A wellness program might find a moderate negative correlation () between weekly exercise hours and resting heart rate. More exercise is associated with a lower resting heart rate, consistent with cardiovascular conditioning.
Medication dosage and side effect severity. A pharmacist might track the correlation between dosage levels and reported side effect scores, using it to identify dosage ranges where side effects escalate rapidly.
In each case, correlation helps identify variables worth investigating further β but it never proves causation on its own. Clinical decisions require controlled trials, not just correlational evidence.
Practice Problems
Test your understanding with these problems. Click to reveal each answer.
Problem 1: Five data pairs are: (1, 10), (2, 20), (3, 30), (4, 40), (5, 50). Without calculating, what is ? Explain.
(exactly).
Every point falls perfectly on the line . There is a perfect positive linear relationship with zero deviation from the line. Since all points lie exactly on a straight line with a positive slope, equals exactly .
Problem 2: A study finds between hours of TV watched per day and GPA among college students. (a) Describe the relationship in words. (b) Can we conclude that watching TV causes lower grades?
(a) There is a strong negative linear relationship between daily TV hours and GPA. Students who watch more television tend to have lower grade point averages.
(b) No, we cannot conclude causation from correlation alone. Possible explanations include:
- Direct causation: TV time displaces study time, leading to lower grades (plausible but not proven).
- Reverse causation: Students struggling academically may turn to TV as an escape.
- Lurking variable: Students with less motivation or poorer time management may both watch more TV and earn lower grades.
A randomized experiment would be needed to establish causation.
Problem 3: Calculate for these four data pairs: (1, 8), (2, 6), (3, 4), (4, 2).
Compute the required sums with :
Numerator:
Denominator:
Answer: . This is a perfect negative linear relationship β every point falls exactly on the line .
Problem 4: A researcher finds between daily coffee consumption and productivity score. Calculate and interpret it.
Interpretation: About 20.3% of the variation in productivity scores is explained by the linear relationship with coffee consumption. The remaining 79.7% is due to other factors (sleep quality, workload, motivation, etc.).
This is a relatively modest , meaning coffee consumption alone is not a strong predictor of productivity, even though the moderate correlation suggests some association.
Problem 5: A dataset produces . A classmate says, βThese variables are not related at all.β Is this statement accurate? Why or why not?
The statement is not entirely accurate. An value of 0.02 means there is essentially no linear relationship between the two variables. However, this does not rule out the possibility of a nonlinear relationship. The variables could be strongly related in a curved, cyclical, or other non-straight-line pattern.
The correct interpretation: There is no meaningful linear association, but you should examine the scatter plot to check for nonlinear patterns before concluding the variables are completely unrelated.
Key Takeaways
- The correlation coefficient measures the strength and direction of a linear relationship between two quantitative variables, ranging from to
- Values of near 1 indicate strong linear association; values near 0 indicate weak or no linear association
- The coefficient of determination tells you the proportion of variation in explained by the linear relationship with
- measures linear relationships only β it can miss curved or nonlinear patterns entirely
- is sensitive to outliers, has no units, and does not change when you swap and
- Correlation does not imply causation β a strong can result from direct causation, reverse causation, a lurking variable, or coincidence
- Always examine the scatter plot alongside to check for nonlinear patterns, outliers, and other features that a single number cannot capture
- To establish causation, you need a randomized controlled experiment, not just an observational correlation
Return to Statistics for more topics in this section.
Next Up in Statistics
All Statistics topicsLast updated: March 29, 2026