Statistics

Correlation

Last updated: March 2026 · Intermediate
Before you start

You should be comfortable with:

Real-world applications
πŸ’Š
Nursing

Medication dosages, IV drip rates, vital monitoring

πŸ’°
Retail & Finance

Discounts, tax, tips, profit margins

Correlation measures the strength and direction of the linear relationship between two quantitative variables. When you look at a scatter plot and see points trending upward or downward, correlation puts a number on that pattern. The result is a single value called the correlation coefficient, denoted rr, that tells you exactly how tightly the data follows a straight line β€” and in which direction.

The Correlation Coefficient (r)

The Pearson correlation coefficient rr is the most common measure of linear association. It ranges from βˆ’1-1 to +1+1:

  • r=+1r = +1: perfect positive linear relationship β€” every point falls exactly on a line that slopes upward
  • r=βˆ’1r = -1: perfect negative linear relationship β€” every point falls exactly on a line that slopes downward
  • r=0r = 0: no linear relationship β€” the data shows no straight-line pattern (though there may be a curved relationship)

Values between these extremes indicate varying degrees of linear association. The closer ∣r∣|r| is to 1, the stronger the linear relationship.

Interpreting the Strength of r

| ∣r∣|r| Range | Interpretation | |---|---| | 0.0 to 0.3 | Weak linear relationship | | 0.3 to 0.7 | Moderate linear relationship | | 0.7 to 1.0 | Strong linear relationship |

These ranges are general guidelines, not strict cutoffs. In some fields (like physics), r=0.7r = 0.7 might be considered weak, while in social sciences, r=0.5r = 0.5 could be considered strong.

The Formula for r

The Pearson correlation coefficient is calculated using:

r=nβˆ‘xyβˆ’βˆ‘xβˆ‘y[nβˆ‘x2βˆ’(βˆ‘x)2][nβˆ‘y2βˆ’(βˆ‘y)2]r = \frac{n\sum xy - \sum x \sum y}{\sqrt{\left[n\sum x^2 - \left(\sum x\right)^2\right]\left[n\sum y^2 - \left(\sum y\right)^2\right]}}

Where nn is the number of data pairs, βˆ‘xy\sum xy is the sum of the products of each pair, βˆ‘x\sum x and βˆ‘y\sum y are the sums of the individual variables, and βˆ‘x2\sum x^2 and βˆ‘y2\sum y^2 are the sums of the squared values.

This formula looks intimidating, but it becomes manageable when you organize your work into a table and compute each sum systematically.

Calculating r Step by Step

Example 1: Study Hours vs Test Score

Six students reported their weekly study hours and their test scores:

StudentHours (xx)Score (yy)
A265
B370
C580
D475
E685
F158

Step 1: Compute the products and squares for each pair.

Studentxxyyxyxyx2x^2y2y^2
A26513044225
B37021094900
C580400256400
D475300165625
E685510367225
F1585813364
Sums2143316089131739

Step 2: Verify the sums.

  • βˆ‘x=2+3+5+4+6+1=21\sum x = 2 + 3 + 5 + 4 + 6 + 1 = 21
  • βˆ‘y=65+70+80+75+85+58=433\sum y = 65 + 70 + 80 + 75 + 85 + 58 = 433
  • βˆ‘xy=130+210+400+300+510+58=1608\sum xy = 130 + 210 + 400 + 300 + 510 + 58 = 1608
  • βˆ‘x2=4+9+25+16+36+1=91\sum x^2 = 4 + 9 + 25 + 16 + 36 + 1 = 91
  • βˆ‘y2=4225+4900+6400+5625+7225+3364=31739\sum y^2 = 4225 + 4900 + 6400 + 5625 + 7225 + 3364 = 31739

Step 3: Plug into the formula with n=6n = 6.

Numerator:

nβˆ‘xyβˆ’βˆ‘xβˆ‘y=6(1608)βˆ’21(433)=9648βˆ’9093=555n\sum xy - \sum x \sum y = 6(1608) - 21(433) = 9648 - 9093 = 555

Denominator:

[6(91)βˆ’212][6(31739)βˆ’4332]\sqrt{\left[6(91) - 21^2\right]\left[6(31739) - 433^2\right]}

=[546βˆ’441][190434βˆ’187489]= \sqrt{[546 - 441][190434 - 187489]}

=105Γ—2945= \sqrt{105 \times 2945}

=309225β‰ˆ556.08= \sqrt{309225} \approx 556.08

Result:

r=555556.08β‰ˆ0.998r = \frac{555}{556.08} \approx 0.998

This is a very strong positive correlation. As study hours increase, test scores increase in a nearly perfect linear pattern.

The Coefficient of Determination (R-squared)

The coefficient of determination, written R2R^2, is simply the square of the correlation coefficient:

R2=r2R^2 = r^2

R2R^2 tells you the proportion of variation in yy that is explained by the linear relationship with xx. It answers the question: β€œHow much of the ups and downs in yy can we account for by knowing xx?”

Example 1 Continued

R2=0.9982β‰ˆ0.996R^2 = 0.998^2 \approx 0.996

This means that 99.6% of the variation in test scores is explained by the linear relationship with study hours. Only 0.4% of the variation is due to other factors. This is an exceptionally high R2R^2 β€” in real-world data, values this high are uncommon.

Interpreting R-squared

R2R^2 ValueInterpretation
0.90 and aboveExcellent β€” nearly all variation explained
0.70 to 0.90Strong β€” most variation explained
0.40 to 0.70Moderate β€” a good portion explained
Below 0.40Weak β€” most variation is unexplained

Properties and Limitations of r

Understanding what rr can and cannot tell you is just as important as knowing how to calculate it.

What r Does Well

  • Has no units. The value of rr is the same whether you measure height in inches or centimeters, weight in pounds or kilograms. This makes it a standardized measure.
  • Does not change when you swap xx and yy. The correlation between study hours and test scores is the same as the correlation between test scores and study hours.
  • Does not change with linear transformations. If you convert temperatures from Fahrenheit to Celsius, rr stays the same.

Limitations of r

  • Sensitive to outliers. A single extreme point can dramatically change rr. One outlier can inflate a weak correlation into a strong one, or destroy a strong correlation.
  • Only detects linear patterns. If the true relationship is curved, quadratic, or exponential, rr will underestimate the strength of the association.
  • Does not imply causation. A high rr between two variables does not mean one causes the other (more on this below).
  • Requires quantitative variables. You cannot compute rr for categorical data like gender or favorite color.

Example 2: When r Misses the Pattern

Consider data points that form a perfect U-shape: as xx goes from 1 to 5, yy decreases, and as xx goes from 5 to 10, yy increases symmetrically. Despite this strong, clear pattern, rr would be approximately 0 because the relationship is not linear. Always look at the scatter plot before relying on rr alone.

Correlation Does Not Imply Causation

This is arguably the most important concept in statistics. Whenever you find a strong correlation, resist the urge to conclude that one variable causes the other. There are three possible explanations for any observed correlation:

1. Direct Causation

Variable xx actually causes changes in yy. Example: increasing the dosage of a medication (within therapeutic range) causes blood pressure to decrease. This can only be confirmed through a controlled experiment, not observational data alone.

2. Reverse Causation

Variable yy actually causes changes in xx, not the other way around. Example: you observe a positive correlation between the number of fire trucks at a scene and the amount of damage. It would be wrong to conclude that sending more trucks causes more damage β€” the reality is that larger fires (more damage) require more trucks.

3. Lurking (Confounding) Variable

A third variable that you did not measure causes both xx and yy to move together. This is the most common explanation for misleading correlations.

Classic examples:

  • Ice cream sales and drowning deaths are positively correlated. The lurking variable is summer heat β€” hot weather drives both ice cream purchases and swimming activity.
  • Shoe size and reading ability are positively correlated in children. The lurking variable is age β€” older children have both bigger feet and better reading skills.
  • Per capita cheese consumption and deaths by bedsheet tangling show a strikingly high correlation over time. This is pure coincidence β€” with enough variables and enough years of data, some pairs will correlate by chance.

The lesson: always ask β€œWhat else could explain this relationship?” before drawing causal conclusions. To establish causation, you need a randomized controlled experiment where one variable is deliberately manipulated while other factors are held constant.

Real-World Application: Nursing β€” Correlating Patient Variables

In clinical settings, nurses and researchers frequently use correlation to explore relationships between patient variables. Some examples:

Patient age and recovery time. A nurse researcher might find a moderate positive correlation (rβ‰ˆ0.6r \approx 0.6) between patient age and the number of days to discharge after knee surgery. This does not mean age directly causes slower recovery β€” the lurking variables include overall fitness, pre-existing conditions, and immune system function.

Exercise hours and resting heart rate. A wellness program might find a moderate negative correlation (rβ‰ˆβˆ’0.5r \approx -0.5) between weekly exercise hours and resting heart rate. More exercise is associated with a lower resting heart rate, consistent with cardiovascular conditioning.

Medication dosage and side effect severity. A pharmacist might track the correlation between dosage levels and reported side effect scores, using it to identify dosage ranges where side effects escalate rapidly.

In each case, correlation helps identify variables worth investigating further β€” but it never proves causation on its own. Clinical decisions require controlled trials, not just correlational evidence.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: Five data pairs are: (1, 10), (2, 20), (3, 30), (4, 40), (5, 50). Without calculating, what is rr? Explain.

r=+1r = +1 (exactly).

Every point falls perfectly on the line y=10xy = 10x. There is a perfect positive linear relationship with zero deviation from the line. Since all points lie exactly on a straight line with a positive slope, rr equals exactly +1+1.

Problem 2: A study finds r=βˆ’0.82r = -0.82 between hours of TV watched per day and GPA among college students. (a) Describe the relationship in words. (b) Can we conclude that watching TV causes lower grades?

(a) There is a strong negative linear relationship between daily TV hours and GPA. Students who watch more television tend to have lower grade point averages.

(b) No, we cannot conclude causation from correlation alone. Possible explanations include:

  • Direct causation: TV time displaces study time, leading to lower grades (plausible but not proven).
  • Reverse causation: Students struggling academically may turn to TV as an escape.
  • Lurking variable: Students with less motivation or poorer time management may both watch more TV and earn lower grades.

A randomized experiment would be needed to establish causation.

Problem 3: Calculate rr for these four data pairs: (1, 8), (2, 6), (3, 4), (4, 2).

Compute the required sums with n=4n = 4:

  • βˆ‘x=1+2+3+4=10\sum x = 1 + 2 + 3 + 4 = 10
  • βˆ‘y=8+6+4+2=20\sum y = 8 + 6 + 4 + 2 = 20
  • βˆ‘xy=8+12+12+8=40\sum xy = 8 + 12 + 12 + 8 = 40
  • βˆ‘x2=1+4+9+16=30\sum x^2 = 1 + 4 + 9 + 16 = 30
  • βˆ‘y2=64+36+16+4=120\sum y^2 = 64 + 36 + 16 + 4 = 120

Numerator: 4(40)βˆ’10(20)=160βˆ’200=βˆ’404(40) - 10(20) = 160 - 200 = -40

Denominator: [4(30)βˆ’100][4(120)βˆ’400]=[120βˆ’100][480βˆ’400]=20Γ—80=1600=40\sqrt{[4(30) - 100][4(120) - 400]} = \sqrt{[120 - 100][480 - 400]} = \sqrt{20 \times 80} = \sqrt{1600} = 40

r=βˆ’4040=βˆ’1r = \frac{-40}{40} = -1

Answer: r=βˆ’1r = -1. This is a perfect negative linear relationship β€” every point falls exactly on the line y=10βˆ’2xy = 10 - 2x.

Problem 4: A researcher finds r=0.45r = 0.45 between daily coffee consumption and productivity score. Calculate R2R^2 and interpret it.

R2=0.452=0.2025R^2 = 0.45^2 = 0.2025

Interpretation: About 20.3% of the variation in productivity scores is explained by the linear relationship with coffee consumption. The remaining 79.7% is due to other factors (sleep quality, workload, motivation, etc.).

This is a relatively modest R2R^2, meaning coffee consumption alone is not a strong predictor of productivity, even though the moderate correlation suggests some association.

Problem 5: A dataset produces r=0.02r = 0.02. A classmate says, β€œThese variables are not related at all.” Is this statement accurate? Why or why not?

The statement is not entirely accurate. An rr value of 0.02 means there is essentially no linear relationship between the two variables. However, this does not rule out the possibility of a nonlinear relationship. The variables could be strongly related in a curved, cyclical, or other non-straight-line pattern.

The correct interpretation: There is no meaningful linear association, but you should examine the scatter plot to check for nonlinear patterns before concluding the variables are completely unrelated.

Key Takeaways

  • The correlation coefficient rr measures the strength and direction of a linear relationship between two quantitative variables, ranging from βˆ’1-1 to +1+1
  • Values of ∣r∣|r| near 1 indicate strong linear association; values near 0 indicate weak or no linear association
  • The coefficient of determination R2=r2R^2 = r^2 tells you the proportion of variation in yy explained by the linear relationship with xx
  • rr measures linear relationships only β€” it can miss curved or nonlinear patterns entirely
  • rr is sensitive to outliers, has no units, and does not change when you swap xx and yy
  • Correlation does not imply causation β€” a strong rr can result from direct causation, reverse causation, a lurking variable, or coincidence
  • Always examine the scatter plot alongside rr to check for nonlinear patterns, outliers, and other features that a single number cannot capture
  • To establish causation, you need a randomized controlled experiment, not just an observational correlation

Return to Statistics for more topics in this section.

Last updated: March 29, 2026