Scatter Plots
Medication dosages, IV drip rates, vital monitoring
Discounts, tax, tips, profit margins
A scatter plot (also called a scatterplot or scatter diagram) displays the relationship between two quantitative variables. Each data point is plotted as a dot on a coordinate plane, with one variable on the horizontal axis and the other on the vertical axis. Scatter plots are one of the most powerful tools in statistics for exploring whether two variables are related β and if so, how.
What Is a Scatter Plot?
A scatter plot uses two axes to represent two different measurements for each individual or observation in a dataset:
- Horizontal axis (-axis): The explanatory variable (also called the independent variable). This is the variable you think might influence or predict the other.
- Vertical axis (-axis): The response variable (also called the dependent variable). This is the variable you think might be affected by the explanatory variable.
Each dot on the plot represents one observation. Its horizontal position corresponds to the -value, and its vertical position corresponds to the -value.
Example: If you are studying whether more hours of study lead to higher exam scores, study hours goes on the -axis (explanatory) and exam score goes on the -axis (response).
Describing Association
When you look at a scatter plot, describe the association (relationship) between the two variables using three characteristics:
Direction
- Positive association: As increases, tends to increase. The dots trend upward from left to right.
- Negative association: As increases, tends to decrease. The dots trend downward from left to right.
- No association: There is no clear upward or downward pattern. The dots appear scattered randomly.
Form
- Linear: The dots roughly follow a straight-line pattern.
- Curved (nonlinear): The dots follow a curve β perhaps a parabola or exponential shape.
- No pattern: The dots do not follow any identifiable shape.
Strength
- Strong: The data points cluster tightly around the underlying pattern (line or curve). There is little scatter.
- Moderate: The points follow a general trend but with noticeable spread.
- Weak: The points are widely scattered, and the trend is hard to see.
When describing a scatter plot, combine all three: for example, βa strong, positive, linear association.β
Creating a Scatter Plot
To create a scatter plot by hand or interpret one on a test:
- Label the axes with variable names and units.
- Choose a scale for each axis that fits all data values with some room to spare.
- Plot each data point as a dot at the intersection of its and values.
- Do not connect the dots β scatter plots show individual points, not a continuous path.
Example 1: Study Hours vs Exam Score
Ten students reported their study hours and exam scores:
| Student | Study Hours () | Exam Score () |
|---|---|---|
| 1 | 2 | 65 |
| 2 | 3 | 70 |
| 3 | 5 | 78 |
| 4 | 1 | 55 |
| 5 | 6 | 82 |
| 6 | 4 | 72 |
| 7 | 7 | 88 |
| 8 | 8 | 91 |
| 9 | 3 | 68 |
| 10 | 5 | 80 |
Study Hours vs Exam Score
Each dot represents one student. The horizontal position shows how many hours that student studied, and the vertical position shows the exam score they earned.
Interpreting Scatter Plots
Example 2: Describe the Association
Looking at the scatter plot from Example 1, we describe the association using the three characteristics:
Direction: Positive. As study hours increase (moving right), exam scores also increase (moving up). Students who studied more tended to score higher.
Form: Linear. The dots roughly follow a straight-line path from the lower left to the upper right. There is no obvious curve.
Strength: Strong. The data points cluster tightly along the linear trend. There is relatively little vertical scatter at any given number of study hours.
Full description: The scatter plot shows a strong, positive, linear association between study hours and exam scores. Students who studied more hours tended to earn higher scores on the exam.
We can also estimate the rate of change. The student who studied 1 hour scored 55, and the student who studied 8 hours scored 91. The approximate rate of increase is:
This means each additional hour of study is associated with roughly a 5-point increase in exam score.
Outliers in Scatter Plots
An outlier in a scatter plot is a point that falls far from the overall pattern of the other data points. Outliers are important because they can:
- Signal a data entry error (the value was recorded incorrectly)
- Represent a genuinely unusual case that deserves investigation
- Influence calculations like the line of best fit, pulling it toward the outlier
How to spot an outlier: Look for a point that is far above, far below, or far to the side of the general cluster of dots.
Example: Suppose our dataset included an 11th student who studied 7 hours but scored only 52. That point would appear far below the cluster of other dots (which are all above 65 for students who studied 2 or more hours). This student might have been ill during the exam, misunderstood the material despite studying, or the recorded score could be a typo.
What to do with outliers:
- Verify the data β check for recording errors first.
- Investigate the cause β is there a reason this point is different?
- Report it β note the outlier in your analysis. Do not silently delete it unless you have a clear justification.
Correlation Does Not Imply Causation
This is one of the most important principles in all of statistics. Just because two variables show a strong association in a scatter plot does not mean that one variable causes the other to change.
Why not? There are several possibilities:
- Lurking variable (confounding factor): A third variable that you did not measure may be driving both. For example, ice cream sales and drowning deaths both increase in summer β but ice cream does not cause drowning. The lurking variable is warm weather, which causes people to both buy ice cream and go swimming.
- Reverse causation: Maybe causes , not the other way around.
- Coincidence: With enough variables, some will appear correlated by pure chance.
Classic examples of misleading correlations:
- Countries that consume more chocolate tend to win more Nobel Prizes. (Lurking variable: national wealth funds both research institutions and chocolate imports.)
- Cities with more firefighters tend to have more fire damage. (Lurking variable: larger cities have more of both.)
- Students who eat breakfast tend to have higher grades. (Possible lurking variables: household stability, parenting involvement, sleep habits.)
The rule: A scatter plot can show that two variables are associated. To prove that one causes the other, you need a controlled experiment where you manipulate one variable while holding all others constant.
Real-World Application: Nursing β Patient Age vs Recovery Time
A nurse researcher collects data on 8 patients who underwent the same knee surgery, recording each patientβs age and the number of days until they were discharged:
| Patient | Age () | Recovery Days () |
|---|---|---|
| 1 | 30 | 3 |
| 2 | 45 | 5 |
| 3 | 52 | 6 |
| 4 | 38 | 4 |
| 5 | 60 | 8 |
| 6 | 55 | 7 |
| 7 | 42 | 4 |
| 8 | 65 | 9 |
Plotting this data would show a positive, linear association: older patients tend to have longer recovery times. The relationship appears moderately strong β most points cluster near a line, though there is some natural variation.
Clinical implication: This association helps nurses anticipate discharge planning needs. A 60-year-old patient will likely need more recovery days (and more post-surgical support) than a 35-year-old patient undergoing the same procedure.
Caution: Age does not directly cause longer recovery. The lurking variables might include overall health, muscle mass, pre-existing conditions, and medication use β all of which tend to correlate with age. A healthy 60-year-old might recover faster than an unhealthy 40-year-old.
Practice Problems
Test your understanding with these problems. Click to reveal each answer.
Problem 1: A scatter plot of advertising spending () vs monthly revenue () shows dots trending upward from left to right in a roughly straight-line pattern with moderate spread. Describe the association using the three characteristics.
Direction: Positive β as advertising spending increases, revenue tends to increase.
Form: Linear β the points follow a roughly straight-line pattern.
Strength: Moderate β the points follow the trend but with noticeable spread (not tightly clustered).
Full description: There is a moderate, positive, linear association between advertising spending and monthly revenue.
Problem 2: Five data points are plotted on a scatter plot: (1, 80), (2, 75), (3, 68), (4, 60), (5, 52). Is the association positive or negative? Estimate the rate of change.
The association is negative β as increases, decreases.
Rate of change:
Answer: The association is negative, with decreasing by approximately 7 units for each 1-unit increase in .
Problem 3: A study finds a strong positive correlation between the number of hospitals in a city and the number of crimes committed. Does this mean hospitals cause crime? Explain.
No. This is a classic example of a lurking variable. The lurking variable is city population. Larger cities have both more hospitals (to serve more people) and more total crimes (simply because there are more people). The hospitals do not cause crime β both variables are driven by population size.
Key principle: Correlation does not imply causation. Always look for confounding variables before drawing causal conclusions.
Problem 4: In a scatter plot, one data point is located at (4, 95) while all other points cluster between and for similar -values. What should you do with this point?
This point is an outlier β it falls far above the overall pattern.
Steps to take:
- Verify the data β check whether 95 was recorded correctly. It may be a typo (e.g., the actual value was 75).
- Investigate β is there a reason this observation is different from the others? Perhaps this individual had unusual circumstances.
- Report it β include the outlier in your analysis and note its presence. Do not remove it without a documented reason.
If the value is confirmed correct, analyze the data both with and without the outlier to see how much it affects results.
Problem 5: A dataset of 10 points has the following -values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and -values: 23, 19, 30, 15, 27, 22, 18, 25, 20, 28. Would you describe this association as strong, moderate, or weak? Why?
Plotting these points would show dots scattered without a clear upward or downward trend. The -values fluctuate up and down as increases, with no consistent direction.
Answer: The association is weak (and arguably shows no association). There is no clear linear pattern β the points are scattered without a consistent trend. The -values do not consistently increase or decrease as increases.
Key Takeaways
- A scatter plot displays the relationship between two quantitative variables, with each dot representing one observation.
- Describe associations using three characteristics: direction (positive, negative, or none), form (linear or curved), and strength (strong, moderate, or weak).
- The explanatory variable goes on the -axis and the response variable goes on the -axis.
- Outliers are points far from the overall pattern β always verify them before removing.
- Correlation does not imply causation β a strong association does not prove that one variable causes changes in the other. Look for lurking variables and confounding factors.
- To establish causation, you need a controlled experiment, not just an observational scatter plot.
Return to Statistics for more topics in this section.
Next Up in Statistics
All Statistics topicsLast updated: March 29, 2026