Describing Distributions

Shape, center, spread, and outliers (SOCS)

Describing Distributions

Introduction

Looking at a graph is just the first step. To fully understand data, we must describe what we see using precise statistical language. The framework SOCS (Shape, Outliers, Center, Spread) provides a systematic approach to describing any distribution.

Shape

Shape describes the overall pattern of the distribution.

Symmetry

Symmetric Distribution:

Left side mirrors right side
Mean ≈ Median
Balanced around center

Examples:

Normal (bell-shaped) distributions
Uniform distributions
Heights of adult males

How to identify: If you fold the distribution at the center, both sides match

Skewness

Right-Skewed (Positively Skewed):

Tail extends to the right
Mean > Median
Most data on left, few high values pull mean right

Examples:

Income (most people earn moderate amounts, few earn very high)
Home prices
Test scores when test is easy (most score high, few score low)

Visual: Peak on left, tail stretches right

Left-Skewed (Negatively Skewed):

Tail extends to the left
Mean < Median
Most data on right, few low values pull mean left

Examples:

Age at death (most live to old age, few die young)
Test scores when test is hard (most score low, few score high)

Visual: Peak on right, tail stretches left

Memory aid: Skewness direction = direction of the tail (not the peak!)

Modality

Number of peaks (modes) in distribution:

Unimodal: One clear peak

Most common pattern
Examples: heights, standardized test scores

Bimodal: Two distinct peaks

Suggests two different groups
Examples: Heights of adults (male peak and female peak)

Multimodal: More than two peaks

Multiple distinct groups
Less common

Uniform: No peaks, all values equally likely

Flat distribution
Example: Rolling a fair die

How to determine: Count prominent "humps" in the distribution

Special Shapes

Normal (Bell-Shaped):

Symmetric
Unimodal
Mean = Median = Mode
Most data near center, decreasing towards extremes
Follows empirical rule (68-95-99.7)

Uniform:

All values equally likely
Rectangular shape
No mode

Exponential:

Decreasing pattern
Extremely right-skewed
Many small values, few large values

Outliers

Outliers are observations that fall notably far from the overall pattern.

Identifying Outliers

Visual method:

Look for isolated points
Values separated from main cluster

1.5 × IQR Rule (for boxplots):

Calculate $IQR = Q3 - Q1$
Lower fence: $Q1 - 1.5 \times IQR$
Upper fence: $Q3 + 1.5 \times IQR$
Outliers fall beyond fences

Example:

Q1 = 65, Q3 = 85
IQR = 85 - 65 = 20
Lower fence: 65 - 1.5(20) = 65 - 30 = 35
Upper fence: 85 + 1.5(20) = 85 + 30 = 115
Values below 35 or above 115 are outliers

Standard deviation method:

Outliers > 2 or 3 standard deviations from mean
Less commonly used
Appropriate for symmetric distributions

Reporting Outliers

Always:

Note their presence: "There is one outlier at 150"
Give actual values if possible
Consider potential causes

Potential causes:

Measurement error: Mistake in recording
Data entry error: Typo when entering data
Legitimate extreme value: Unusual but real observation
Different population: Doesn't belong in this group

What to do:

Investigate cause if possible
Report with and without outliers (if they affect conclusions)
Don't automatically delete (unless proven error)

Center

Center describes the "typical" or "middle" value.

Mean vs. Median

When to use each:

Mean ( $\bar{x}$ ):

Symmetric distributions
No outliers
Want to use all data values
Mathematical properties needed

Median:

Skewed distributions
Presence of outliers
Want resistant measure
Ordinal data

Relationship to shape:

Symmetric: Mean ≈ Median
Right-skewed: Mean > Median (mean pulled right by tail)
Left-skewed: Mean < Median (mean pulled left by tail)

Mode

Definition: Most frequently occurring value

When reported:

Categorical data
Describing bimodal distributions
Identifying popular values

Limitations:

May not exist (all values occur once)
May not be unique (multiple modes)
Not useful for continuous data with no repeated values

Spread

Spread describes the variability or dispersion of data.

Range

Definition: Maximum - Minimum

Formula: $Range = Max - Min$

Advantages:

Easy to calculate
Easy to understand
Gives sense of total spread

Disadvantages:

Affected by outliers
Ignores distribution between extremes
Only uses two values

Example:

Data: 12, 15, 18, 20, 22, 25, 100
Range = 100 - 12 = 88
Dominated by outlier (100)

Interquartile Range (IQR)

Definition: Range of middle 50% of data

Formula: $IQR = Q3 - Q1$

Advantages:

Resistant to outliers
Focuses on middle of distribution
Useful for boxplots

Disadvantages:

Ignores lowest 25% and highest 25%
Less intuitive than range

Example:

Q1 = 65, Q3 = 85
IQR = 85 - 65 = 20
Middle 50% of data spans 20 points

Interpretation: "Half the data falls within [IQR] points"

Standard Deviation

Definition: Average distance from the mean

Interpretation: Typical deviation from mean

Advantages:

Uses all data values
Has important mathematical properties
Basis for many statistical methods

Disadvantages:

Affected by outliers
Less intuitive than range
Only meaningful for roughly symmetric distributions

When to report:

Symmetric distributions
No extreme outliers
Want to use standard statistical methods

Context Matters!

Units

Always include units in descriptions:

❌ "The mean is 68"
✓ "The mean height is 68 inches"

❌ "The standard deviation is 3.5"
✓ "The standard deviation of test scores is 3.5 points"

Comparison

Describe in context of:

What you'd expect
Other groups
Previous studies

Examples:

"Students averaged 85%, which is higher than last year's 78%"
"The standard deviation of 15 points shows high variability"

Complete Description Template

A complete distribution description includes:

Shape: "The distribution of [variable] is [symmetric/right-skewed/left-skewed] and [unimodal/bimodal/etc.]"

Outliers: "There is/are [number] outlier(s) at [value(s)]" or "There are no apparent outliers"

Center: "The [mean/median] [variable] is [value with units]"

Spread: "The [variable] ranges from [min] to [max] [units]" or "The standard deviation is [value] [units]"

Example:

Data: Test scores in AP Statistics class

"The distribution of test scores is slightly right-skewed and unimodal with one outlier at 45%. The median score is 82%, indicating that half the students scored below 82%. Scores range from 45% to 98%, with an IQR of 12 percentage points, meaning the middle 50% of students scored within a 12-point range. The outlier at 45% is notably below the main cluster of scores between 70% and 98%."

Common Patterns and Interpretations

What Shape Tells Us

Symmetric:

Process or measurement is balanced
Natural variation around center
Use mean and standard deviation

Right-skewed:

Floor effect (minimum limit)
Most values small, few very large
Use median and IQR

Left-skewed:

Ceiling effect (maximum limit)
Most values large, few very small
Use median and IQR

Bimodal:

Two distinct groups mixed together
Consider separating and analyzing separately

What Outliers Tell Us

Potential meanings:

Errors (investigate and possibly correct)
Unusual but legitimate cases
Different population mixed in
Rare but important events

Impact:

Affect mean more than median
Affect standard deviation more than IQR
Can change conclusions if not addressed

What Spread Tells Us

Large spread:

High variability
Data quite different from typical value
Less predictability

Small spread:

Low variability
Data close to typical value
More consistency, predictability

Comparing Distributions

When comparing two or more distributions:

Address each of SOCS:

Shape:

"Group A is symmetric while Group B is right-skewed"

Outliers:

"Both groups have outliers, but Group A's are more extreme"

Center:

"Group A has a higher median (75) than Group B (68)"

Spread:

"Group A shows more variability (SD = 12) than Group B (SD = 8)"

Example comparison:

"Both male and female height distributions are roughly symmetric and unimodal. Males have a higher mean height (70 inches) compared to females (64 inches), a difference of 6 inches. Both distributions have similar spreads, with standard deviations of approximately 3 inches. Neither distribution shows outliers."

Common Mistakes

❌ Confusing skewness direction (it's the tail, not the peak!)
❌ Using mean with skewed data (median is more appropriate)
❌ Reporting center without spread (both are needed!)
❌ Ignoring units (always include them)
❌ Incomplete descriptions (use full SOCS framework)
❌ Not describing in context (relate to actual situation)
❌ Confusing SD and IQR (they measure spread differently)

Quick Reference

SOCS Framework:

Shape: Symmetric? Skewed (which direction)? Unimodal/bimodal?
Outliers: Present? Where? How many?
Center: Mean or median (with units!)
Spread: Range, IQR, or SD (with units!)

Mean vs. Median:

Symmetric, no outliers → Use mean
Skewed or outliers → Use median

SD vs. IQR:

Symmetric, no outliers → Use SD
Skewed or outliers → Use IQR

Skewness:

Right-skewed: Mean > Median, tail to right
Left-skewed: Mean < Median, tail to left
Symmetric: Mean ≈ Median

Remember: A complete description tells the story of the data. Don't just report numbers — interpret them in context and explain what they mean!

📚 Practice Problems

1Problem 1easy

❓ Question:

Describe the shape of a distribution that is: a) Symmetric b) Skewed right c) Skewed left

💡 Show Solution

Step 1: Symmetric distribution

Mirror image on both sides of center
Mean ≈ Median
Example: Normal distribution, heights
Tail length equal on both sides

Step 2: Skewed right (positive skew)

Tail extends to the right
Mean > Median (pulled toward tail)
Example: Income, house prices
Most data on left, few high values

Step 3: Skewed left (negative skew)

Tail extends to the left
Mean < Median (pulled toward tail)
Example: Test scores (when easy), age at death
Most data on right, few low values

Memory trick: "The skew points where the tail points"

Visual summary: Symmetric: <-center-> Right skew: <-center----> Left skew: <----center->

Answer: a) Symmetric: balanced on both sides, mean = median b) Skewed right: long right tail, mean > median c) Skewed left: long left tail, mean < median

2Problem 2easy

❓ Question:

A dataset has the following properties: Mean = 75, Median = 80. What can you conclude about the shape of the distribution?

💡 Show Solution

Step 1: Compare mean and median Mean = 75 Median = 80 Mean < Median

Step 2: Recall the relationship When Mean < Median:

Distribution is skewed LEFT (negative skew)
Tail points to lower values
A few low values pull the mean down

Step 3: Explain why The mean is sensitive to extreme values The median is resistant to outliers If mean is pulled below median, there must be some low outliers or a left tail

Step 4: Visualize Most data is clustered around 80 (median) Some lower values around or below 75 These low values drag the mean down below the median

Example: If test scores are mostly in 70s-90s, but a few students scored in 40s-50s, mean would be pulled down while median stays high.

Answer: The distribution is skewed LEFT (negatively skewed) because Mean < Median, indicating a long tail toward lower values.

3Problem 3medium

❓ Question:

Identify whether each distribution is unimodal, bimodal, or multimodal: a) Heights of adult humans (all genders) b) Test scores where most students got A or F c) Ages of people at a kids movie theater

💡 Show Solution

Step 1: Understand modes Unimodal: One clear peak Bimodal: Two distinct peaks Multimodal: More than two peaks

Step 2: Analyze each scenario

a) Heights of all adult humans

Women cluster around ~5'4" (163 cm)
Men cluster around ~5'9" (175 cm)
Two distinct groups Answer: BIMODAL

b) Test scores with mostly A or F

Cluster around 90-100 (A students)
Cluster around 0-60 (F students)
Few in between (B, C, D) Answer: BIMODAL

c) Ages at kids movie

Young children (ages 5-12)
Parents (ages 30-45)
Possibly grandparents (ages 60-75)
Could have 2-3 distinct groups Answer: BIMODAL or MULTIMODAL (likely 2-3 peaks)

Answer: a) Bimodal (male and female heights) b) Bimodal (A and F peaks) c) Bimodal/Multimodal (children and adults)

4Problem 4medium

❓ Question:

Describe this distribution using the SOCS framework (Shape, Outliers, Center, Spread): Data shows exam scores with most values between 70-85, mean=77, median=78, one score at 45, and range=40.

💡 Show Solution

SOCS Framework for describing distributions:

S - SHAPE: Mean (77) ≈ Median (78), very close This suggests roughly SYMMETRIC distribution However, presence of low outlier (45) suggests slight left skew Overall: Roughly symmetric, possibly slight left skew

O - OUTLIERS: Score of 45 is notably low With most scores 70-85 and one at 45: 45 is likely an outlier (more than 25 points below typical) Need to check with 1.5×IQR rule, but appears to be outlier

C - CENTER: Mean = 77 Median = 78 Typical exam score around 77-78 Mean slightly pulled down by low outlier

S - SPREAD: Range = 40 points (from 45 to 85) Most data in 70-85 range (about 15 points) Without outlier, spread would be smaller IQR likely around 10-15 points

Complete SOCS description: "The distribution of exam scores is roughly symmetric with a possible slight left skew due to one low outlier at 45. The center of the distribution is around 77-78 (mean and median nearly equal). The scores spread from 45 to 85, a range of 40 points, though most scores cluster between 70-85. The score of 45 appears to be an outlier, sitting well below the main body of data."

Answer: Symmetric/slight left skew, one low outlier (45), center ~77-78, range=40 with most data in 15-point range.

5Problem 5hard

❓ Question:

Two distributions have the same mean (50) and same range (20-80). Distribution A is uniform (flat), while Distribution B is normal (bell-shaped). Which distribution would have a larger standard deviation, and why?

💡 Show Solution

Step 1: Visualize both distributions Both: Mean = 50, Range = 60 (from 20 to 80)

Distribution A (Uniform):

Data spread EVENLY from 20 to 80
Every value equally likely
Flat histogram

Distribution B (Normal):

Data concentrated near mean (50)
Fewer values at extremes (20 and 80)
Bell-shaped curve

Step 2: Understand standard deviation SD measures average distance from the mean Larger SD = more spread out from center

Step 3: Compare spread from mean

Distribution A (Uniform):

Many values far from mean (50)
Values at 20 and 80 are 30 units from mean
Lots of data at extremes
Higher average distance from mean

Distribution B (Normal):

Most data near mean (50)
Few values at 20 and 80
Less data at extremes
Lower average distance from mean

Step 4: Calculate mental estimate Uniform: Roughly SD ≈ range/3.5 ≈ 60/3.5 ≈ 17 Normal: Roughly SD ≈ range/6 ≈ 60/6 ≈ 10 (These are approximations)

Answer: Distribution A (uniform) has LARGER standard deviation because more of its data is spread far from the mean, while Distribution B (normal) has most data clustered near the center. Even with the same range, uniform distributions have more variability than normal distributions.

🎴

Practice with Flashcards

Review key concepts with our flashcard system

📖

Browse All Topics

Explore other calculus topics