Statistics

Describing Distributions

Last updated: March 2026 · Intermediate

Before you start

You should be comfortable with:

Mean, Median, and Mode Range and Standard Deviation

Real-world applications

💊

Nursing

Medication dosages, IV drip rates, vital monitoring

💰

Retail & Finance

Discounts, tax, tips, profit margins

When someone hands you a dataset, the first step is to describe its distribution. Before you calculate anything, you need a clear picture of how the data is spread out. A good description tells the reader what the data looks like, where the center is, how spread out the values are, and whether anything unusual is going on. Statisticians have a simple framework for doing this consistently every time.

The SOCS Framework

SOCS is a checklist for describing any distribution:

Letter	Stands For	What to Address
S	Shape	Is the distribution symmetric, skewed, or bimodal?
O	Outliers	Are there any unusually high or low values?
C	Center	What is a typical value? (mean or median)
S	Spread	How much do the values vary? (range, IQR, standard deviation)

When you describe a distribution, address all four elements in order. This ensures you never miss an important feature of the data.

Shape of a Distribution

The shape of a distribution describes the overall pattern of the data when you look at a histogram or dot plot. There are five common shapes you should know.

Symmetric — The left side is roughly a mirror image of the right side. The mean and median are approximately equal. Example: test scores on a well-designed exam typically form a symmetric bell shape.

Skewed Right (Positively Skewed) — The bulk of the data is on the left, with a long tail stretching to the right. The mean is greater than the median because the tail pulls the mean toward higher values. Example: household income in the United States is skewed right — most households earn moderate amounts, but a few very high earners stretch the tail far to the right.

Skewed Left (Negatively Skewed) — The bulk of the data is on the right, with a long tail stretching to the left. The mean is less than the median because the tail pulls the mean toward lower values. Example: age at retirement is skewed left — most people retire around 62 to 67, but a few retire much earlier, creating a left tail.

Bimodal — The distribution has two distinct peaks. This often indicates that the data comes from two different groups mixed together. Example: the heights of a mixed male and female group often show two peaks — one near the average female height and one near the average male height.

Uniform — All values occur with roughly equal frequency. There is no peak and no tail. Example: the outcomes of rolling a fair die — each face (1 through 6) comes up about equally often.

Three Common Distribution Shapes

How Shape Affects Center

The shape of a distribution determines the relationship between the mean and the median. This is one of the most important ideas in descriptive statistics because it tells you which measure of center to trust.

Shape	Relationship	Which Measure to Report
Symmetric	Mean $\approx$ Median	Either (mean is standard)
Skewed Right	Mean $>$ Median	Median (resistant to tail)
Skewed Left	Mean $<$ Median	Median (resistant to tail)

Why does this happen? The mean is calculated using every value in the dataset, so extreme values in the tail pull it toward them. The median only depends on the position of the middle value, so it is not affected by how far away the extremes are. When data is skewed, the median is a better representation of a “typical” value.

Effects of Outliers

An outlier is a value that is unusually far from the rest of the data. Outliers have a dramatic effect on some statistics and almost no effect on others.

Example 1: Salary Data

Consider the salaries (in dollars) of seven employees at a small company:

$35{,}000, \quad 38{,}000, \quad 40{,}000, \quad 42{,}000, \quad 45{,}000, \quad 48{,}000, \quad 200{,}000$

The last value ($200,000) is an outlier — likely the owner’s salary.

With the outlier included (all 7 values):

$\text{Mean} = \frac{35{,}000 + 38{,}000 + 40{,}000 + 42{,}000 + 45{,}000 + 48{,}000 + 200{,}000}{7} = \frac{448{,}000}{7} = 64{,}000$

$\text{Median} = 42{,}000 \quad \text{(the 4th value of 7)}$

Without the outlier (first 6 values only):

$\text{Mean} = \frac{35{,}000 + 38{,}000 + 40{,}000 + 42{,}000 + 45{,}000 + 48{,}000}{6} = \frac{248{,}000}{6} \approx 41{,}333$

$\text{Median} = \frac{40{,}000 + 42{,}000}{2} = 41{,}000$

Statistic	With Outlier	Without Outlier	Change
Mean	$64,000	$41,333	$22,667 shift
Median	$42,000	$41,000	$1,000 shift

The mean jumped by over $22,000 because of a single outlier. The median barely moved. This is why the median is called a resistant measure — it resists the influence of extreme values.

Statistic	Resistant?	Why
Mean	No	Uses every value in the calculation — outliers pull it
Median	Yes	Depends only on position, not on the magnitude of extremes
Range	No	Directly determined by the most extreme values
IQR	Yes	Based on quartiles, which ignore extremes
Standard Deviation	No	Squaring deviations magnifies the effect of outliers

Putting It All Together: Describing a Distribution

Now let’s use the full SOCS framework to describe a real dataset.

Example 2: Patient Wait Times

A clinic recorded the wait times (in minutes) for 15 patients on a Monday morning:

$5, \; 8, \; 10, \; 12, \; 12, \; 15, \; 18, \; 20, \; 22, \; 25, \; 30, \; 35, \; 45, \; 50, \; 60$

Shape: The distribution is skewed right. Most patients wait between 5 and 25 minutes, but there is a long tail of longer wait times stretching up to 60 minutes.

Outliers: To check for outliers, we need the quartiles and the 1.5 $\times$ IQR rule.

There are 15 values, so the median is the 8th value: $Q_2 = 20$ .

Lower half (values 1 through 7): $5, 8, 10, 12, 12, 15, 18$ . The median of these 7 values is the 4th: $Q_1 = 12$ .

Upper half (values 9 through 15): $22, 25, 30, 35, 45, 50, 60$ . The median of these 7 values is the 4th: $Q_3 = 35$ .

$\text{IQR} = Q_3 - Q_1 = 35 - 12 = 23$

$\text{Upper fence} = Q_3 + 1.5 \times \text{IQR} = 35 + 1.5 \times 23 = 35 + 34.5 = 69.5$

$\text{Lower fence} = Q_1 - 1.5 \times \text{IQR} = 12 - 1.5 \times 23 = 12 - 34.5 = -22.5$

Since all values fall between $-22.5$ and $69.5$ , there are no outliers by the 1.5 $\times$ IQR rule. The values 50 and 60 are on the high end but do not exceed the upper fence.

Center: The median wait time is 20 minutes. Because the distribution is skewed right, the median is a better measure of center than the mean. (The mean would be pulled higher by the long right tail.)

Spread: The range is $60 - 5 = 55$ minutes, and the IQR is 23 minutes. The middle 50% of patients waited between 12 and 35 minutes.

Full description: “The distribution of patient wait times is skewed right with no outliers. The median wait time is 20 minutes, with an IQR of 23 minutes (from 12 to 35 minutes). While most patients are seen within 25 minutes, a few patients waited considerably longer, up to 60 minutes.”

Real-World Application: Nursing — Describing Blood Glucose Levels

A nurse collects fasting blood glucose readings (mg/dL) from 12 patients in a diabetes screening clinic:

$82, \; 88, \; 91, \; 95, \; 97, \; 100, \; 105, \; 110, \; 118, \; 130, \; 145, \; 210$

Using the SOCS framework:

Shape: Skewed right — most readings cluster between 82 and 130, but there is a long tail to the right caused by the 210 reading.

Outliers: With 12 values, $Q_2 = \frac{100 + 105}{2} = 102.5$ . Lower half: $82, 88, 91, 95, 97, 100$ gives $Q_1 = \frac{91 + 95}{2} = 93$ . Upper half: $105, 110, 118, 130, 145, 210$ gives $Q_3 = \frac{118 + 130}{2} = 124$ .

$\text{IQR} = 124 - 93 = 31$

$\text{Upper fence} = 124 + 1.5 \times 31 = 124 + 46.5 = 170.5$

The value 210 exceeds 170.5 — it is an outlier. This patient likely has uncontrolled diabetes and should be flagged for follow-up.

Center: Median = 102.5 mg/dL. The median is appropriate here because of the right skew and the outlier.

Spread: IQR = 31 mg/dL. Range = $210 - 82 = 128$ mg/dL, but the range is inflated by the outlier.

Clinical note: Normal fasting glucose is 70 to 99 mg/dL. Pre-diabetes is 100 to 125. Diabetes is 126 or higher. This distribution shows that one-third of the patients in the screening are in the pre-diabetic range, with at least one requiring urgent intervention.

Practice Problems

Test your understanding with these problems. Click to reveal each answer.

Problem 1: A dataset of exam scores has a mean of 74 and a median of 74. What can you say about the shape of the distribution?

When the mean and median are approximately equal, the distribution is likely symmetric. The mean is not being pulled in either direction by a tail, which indicates the data is roughly evenly distributed on both sides of center.

Answer: The distribution is approximately symmetric.

Problem 2: Home sale prices in a neighborhood have a mean of $385,000 and a median of $320,000. Describe the likely shape and explain why.

The mean ($385,000) is considerably larger than the median ($320,000). This happens when a few very expensive homes pull the mean upward while the median remains anchored at the middle position.

Answer: The distribution is skewed right (positively skewed). A few high-priced homes create a long right tail, pulling the mean above the median.

Problem 3: Dataset: 10, 12, 13, 14, 14, 15, 15, 16, 80. Calculate the mean and median. Which better represents a “typical” value?

$\text{Mean} = \frac{10 + 12 + 13 + 14 + 14 + 15 + 15 + 16 + 80}{9} = \frac{189}{9} = 21$

$\text{Median} = 14 \quad \text{(the 5th of 9 values)}$

The mean of 21 is higher than 8 of the 9 values — it does not represent a typical value at all. The outlier (80) inflated it.

Answer: The median of 14 is the better measure of center. The outlier at 80 pulls the mean to 21, which is misleadingly high.

Problem 4: Describe this dataset using the SOCS framework: 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8.

Shape: Roughly symmetric, slightly mounded in the center. Values peak around 5 and taper off on both sides.

Outliers: $Q_1 = 4$ (median of first 8 values: $\frac{4+4}{2} = 4$ ), $Q_3 = 6$ (median of last 8 values: $\frac{6+6}{2} = 6$ ). IQR = $6 - 4 = 2$ . Lower fence = $4 - 3 = 1$ . Upper fence = $6 + 3 = 9$ . All values are between 1 and 9, so no outliers.

Center: Median = $\frac{5+5}{2} = 5$ . Mean = $\frac{2+3+3+4+4+4+5+5+5+5+6+6+6+7+7+8}{16} = \frac{80}{16} = 5$ . Mean and median agree, consistent with a symmetric shape.

Spread: Range = $8 - 2 = 6$ . IQR = $2$ .

Answer: The distribution is approximately symmetric with no outliers, centered at 5, with a range of 6 and an IQR of 2.

Problem 5: Why would a retail manager prefer the median over the mean when reporting “typical” daily sales, given that Black Friday and holiday sales are included in the dataset?

Black Friday and holiday sales are extreme high values (outliers) that would pull the mean upward, making the “typical” day seem much more profitable than it actually is. The median is resistant to these outliers and gives a more accurate picture of what a normal day’s sales look like.

Answer: The median is preferred because holiday sales are outliers that inflate the mean. The median reflects a typical day without being distorted by a few extreme values.

Key Takeaways

Use the SOCS framework (Shape, Outliers, Center, Spread) to describe any distribution systematically
Symmetric distributions have mean $\approx$ median; skewed right distributions have mean $>$ median; skewed left distributions have mean $<$ median
Outliers strongly affect the mean, range, and standard deviation, but have little impact on the median and IQR
When data is skewed or has outliers, the median and IQR are better measures of center and spread than the mean and standard deviation
Always describe the context — a distribution is not just numbers, it’s measurements of something real (wait times, salaries, blood glucose levels)

Return to Statistics for more topics in this section.

Next Up in Statistics

Mean, Median, and Mode Range and Standard Deviation Addition Rule of Probability One-Way ANOVA

All Statistics topics

Last updated: March 29, 2026