Describing Distributions

Shape, center, spread, and outliers (SOCS)

Describing Distributions

Introduction

Looking at a graph is just the first step. To fully understand data, we must describe what we see using precise statistical language. The framework SOCS (Shape, Outliers, Center, Spread) provides a systematic approach to describing any distribution.

Shape

Shape describes the overall pattern of the distribution.

Symmetry

Symmetric Distribution:

  • Left side mirrors right side
  • Mean ≈ Median
  • Balanced around center

Examples:

  • Normal (bell-shaped) distributions
  • Uniform distributions
  • Heights of adult males

How to identify: If you fold the distribution at the center, both sides match

Skewness

Right-Skewed (Positively Skewed):

  • Tail extends to the right
  • Mean > Median
  • Most data on left, few high values pull mean right

Examples:

  • Income (most people earn moderate amounts, few earn very high)
  • Home prices
  • Test scores when test is easy (most score high, few score low)

Visual: Peak on left, tail stretches right

Left-Skewed (Negatively Skewed):

  • Tail extends to the left
  • Mean < Median
  • Most data on right, few low values pull mean left

Examples:

  • Age at death (most live to old age, few die young)
  • Test scores when test is hard (most score low, few score high)

Visual: Peak on right, tail stretches left

Memory aid: Skewness direction = direction of the tail (not the peak!)

Modality

Number of peaks (modes) in distribution:

Unimodal: One clear peak

  • Most common pattern
  • Examples: heights, standardized test scores

Bimodal: Two distinct peaks

  • Suggests two different groups
  • Examples: Heights of adults (male peak and female peak)

Multimodal: More than two peaks

  • Multiple distinct groups
  • Less common

Uniform: No peaks, all values equally likely

  • Flat distribution
  • Example: Rolling a fair die

How to determine: Count prominent "humps" in the distribution

Special Shapes

Normal (Bell-Shaped):

  • Symmetric
  • Unimodal
  • Mean = Median = Mode
  • Most data near center, decreasing towards extremes
  • Follows empirical rule (68-95-99.7)

Uniform:

  • All values equally likely
  • Rectangular shape
  • No mode

Exponential:

  • Decreasing pattern
  • Extremely right-skewed
  • Many small values, few large values

Outliers

Outliers are observations that fall notably far from the overall pattern.

Identifying Outliers

Visual method:

  • Look for isolated points
  • Values separated from main cluster

1.5 × IQR Rule (for boxplots):

  • Calculate IQR=Q3Q1IQR = Q3 - Q1
  • Lower fence: Q11.5×IQRQ1 - 1.5 \times IQR
  • Upper fence: Q3+1.5×IQRQ3 + 1.5 \times IQR
  • Outliers fall beyond fences

Example:

  • Q1 = 65, Q3 = 85
  • IQR = 85 - 65 = 20
  • Lower fence: 65 - 1.5(20) = 65 - 30 = 35
  • Upper fence: 85 + 1.5(20) = 85 + 30 = 115
  • Values below 35 or above 115 are outliers

Standard deviation method:

  • Outliers > 2 or 3 standard deviations from mean
  • Less commonly used
  • Appropriate for symmetric distributions

Reporting Outliers

Always:

  • Note their presence: "There is one outlier at 150"
  • Give actual values if possible
  • Consider potential causes

Potential causes:

  • Measurement error: Mistake in recording
  • Data entry error: Typo when entering data
  • Legitimate extreme value: Unusual but real observation
  • Different population: Doesn't belong in this group

What to do:

  • Investigate cause if possible
  • Report with and without outliers (if they affect conclusions)
  • Don't automatically delete (unless proven error)

Center

Center describes the "typical" or "middle" value.

Mean vs. Median

When to use each:

Mean (xˉ\bar{x}):

  • Symmetric distributions
  • No outliers
  • Want to use all data values
  • Mathematical properties needed

Median:

  • Skewed distributions
  • Presence of outliers
  • Want resistant measure
  • Ordinal data

Relationship to shape:

  • Symmetric: Mean ≈ Median
  • Right-skewed: Mean > Median (mean pulled right by tail)
  • Left-skewed: Mean < Median (mean pulled left by tail)

Mode

Definition: Most frequently occurring value

When reported:

  • Categorical data
  • Describing bimodal distributions
  • Identifying popular values

Limitations:

  • May not exist (all values occur once)
  • May not be unique (multiple modes)
  • Not useful for continuous data with no repeated values

Spread

Spread describes the variability or dispersion of data.

Range

Definition: Maximum - Minimum

Formula: Range=MaxMinRange = Max - Min

Advantages:

  • Easy to calculate
  • Easy to understand
  • Gives sense of total spread

Disadvantages:

  • Affected by outliers
  • Ignores distribution between extremes
  • Only uses two values

Example:

  • Data: 12, 15, 18, 20, 22, 25, 100
  • Range = 100 - 12 = 88
  • Dominated by outlier (100)

Interquartile Range (IQR)

Definition: Range of middle 50% of data

Formula: IQR=Q3Q1IQR = Q3 - Q1

Advantages:

  • Resistant to outliers
  • Focuses on middle of distribution
  • Useful for boxplots

Disadvantages:

  • Ignores lowest 25% and highest 25%
  • Less intuitive than range

Example:

  • Q1 = 65, Q3 = 85
  • IQR = 85 - 65 = 20
  • Middle 50% of data spans 20 points

Interpretation: "Half the data falls within [IQR] points"

Standard Deviation

Definition: Average distance from the mean

Interpretation: Typical deviation from mean

Advantages:

  • Uses all data values
  • Has important mathematical properties
  • Basis for many statistical methods

Disadvantages:

  • Affected by outliers
  • Less intuitive than range
  • Only meaningful for roughly symmetric distributions

When to report:

  • Symmetric distributions
  • No extreme outliers
  • Want to use standard statistical methods

Context Matters!

Units

Always include units in descriptions:

❌ "The mean is 68"
✓ "The mean height is 68 inches"

❌ "The standard deviation is 3.5"
✓ "The standard deviation of test scores is 3.5 points"

Comparison

Describe in context of:

  • What you'd expect
  • Other groups
  • Previous studies

Examples:

  • "Students averaged 85%, which is higher than last year's 78%"
  • "The standard deviation of 15 points shows high variability"

Complete Description Template

A complete distribution description includes:

Shape: "The distribution of [variable] is [symmetric/right-skewed/left-skewed] and [unimodal/bimodal/etc.]"

Outliers: "There is/are [number] outlier(s) at [value(s)]" or "There are no apparent outliers"

Center: "The [mean/median] [variable] is [value with units]"

Spread: "The [variable] ranges from [min] to [max] [units]" or "The standard deviation is [value] [units]"

Example:

Data: Test scores in AP Statistics class

"The distribution of test scores is slightly right-skewed and unimodal with one outlier at 45%. The median score is 82%, indicating that half the students scored below 82%. Scores range from 45% to 98%, with an IQR of 12 percentage points, meaning the middle 50% of students scored within a 12-point range. The outlier at 45% is notably below the main cluster of scores between 70% and 98%."

Common Patterns and Interpretations

What Shape Tells Us

Symmetric:

  • Process or measurement is balanced
  • Natural variation around center
  • Use mean and standard deviation

Right-skewed:

  • Floor effect (minimum limit)
  • Most values small, few very large
  • Use median and IQR

Left-skewed:

  • Ceiling effect (maximum limit)
  • Most values large, few very small
  • Use median and IQR

Bimodal:

  • Two distinct groups mixed together
  • Consider separating and analyzing separately

What Outliers Tell Us

Potential meanings:

  • Errors (investigate and possibly correct)
  • Unusual but legitimate cases
  • Different population mixed in
  • Rare but important events

Impact:

  • Affect mean more than median
  • Affect standard deviation more than IQR
  • Can change conclusions if not addressed

What Spread Tells Us

Large spread:

  • High variability
  • Data quite different from typical value
  • Less predictability

Small spread:

  • Low variability
  • Data close to typical value
  • More consistency, predictability

Comparing Distributions

When comparing two or more distributions:

Address each of SOCS:

Shape:

  • "Group A is symmetric while Group B is right-skewed"

Outliers:

  • "Both groups have outliers, but Group A's are more extreme"

Center:

  • "Group A has a higher median (75) than Group B (68)"

Spread:

  • "Group A shows more variability (SD = 12) than Group B (SD = 8)"

Example comparison:

"Both male and female height distributions are roughly symmetric and unimodal. Males have a higher mean height (70 inches) compared to females (64 inches), a difference of 6 inches. Both distributions have similar spreads, with standard deviations of approximately 3 inches. Neither distribution shows outliers."

Common Mistakes

Confusing skewness direction (it's the tail, not the peak!)
Using mean with skewed data (median is more appropriate)
Reporting center without spread (both are needed!)
Ignoring units (always include them)
Incomplete descriptions (use full SOCS framework)
Not describing in context (relate to actual situation)
Confusing SD and IQR (they measure spread differently)

Quick Reference

SOCS Framework:

  • Shape: Symmetric? Skewed (which direction)? Unimodal/bimodal?
  • Outliers: Present? Where? How many?
  • Center: Mean or median (with units!)
  • Spread: Range, IQR, or SD (with units!)

Mean vs. Median:

  • Symmetric, no outliers → Use mean
  • Skewed or outliers → Use median

SD vs. IQR:

  • Symmetric, no outliers → Use SD
  • Skewed or outliers → Use IQR

Skewness:

  • Right-skewed: Mean > Median, tail to right
  • Left-skewed: Mean < Median, tail to left
  • Symmetric: Mean ≈ Median

Remember: A complete description tells the story of the data. Don't just report numbers — interpret them in context and explain what they mean!

📚 Practice Problems

1Problem 1easy

Question:

Describe the shape of a distribution that is: a) Symmetric b) Skewed right c) Skewed left

💡 Show Solution

Step 1: Symmetric distribution

  • Mirror image on both sides of center
  • Mean ≈ Median
  • Example: Normal distribution, heights
  • Tail length equal on both sides

Step 2: Skewed right (positive skew)

  • Tail extends to the right
  • Mean > Median (pulled toward tail)
  • Example: Income, house prices
  • Most data on left, few high values

Step 3: Skewed left (negative skew)

  • Tail extends to the left
  • Mean < Median (pulled toward tail)
  • Example: Test scores (when easy), age at death
  • Most data on right, few low values

Memory trick: "The skew points where the tail points"

Visual summary: Symmetric: <-center-> Right skew: <-center----> Left skew: <----center->

Answer: a) Symmetric: balanced on both sides, mean = median b) Skewed right: long right tail, mean > median c) Skewed left: long left tail, mean < median

2Problem 2easy

Question:

A dataset has the following properties: Mean = 75, Median = 80. What can you conclude about the shape of the distribution?

💡 Show Solution

Step 1: Compare mean and median Mean = 75 Median = 80 Mean < Median

Step 2: Recall the relationship When Mean < Median:

  • Distribution is skewed LEFT (negative skew)
  • Tail points to lower values
  • A few low values pull the mean down

Step 3: Explain why The mean is sensitive to extreme values The median is resistant to outliers If mean is pulled below median, there must be some low outliers or a left tail

Step 4: Visualize Most data is clustered around 80 (median) Some lower values around or below 75 These low values drag the mean down below the median

Example: If test scores are mostly in 70s-90s, but a few students scored in 40s-50s, mean would be pulled down while median stays high.

Answer: The distribution is skewed LEFT (negatively skewed) because Mean < Median, indicating a long tail toward lower values.

3Problem 3medium

Question:

Identify whether each distribution is unimodal, bimodal, or multimodal: a) Heights of adult humans (all genders) b) Test scores where most students got A or F c) Ages of people at a kids movie theater

💡 Show Solution

Step 1: Understand modes Unimodal: One clear peak Bimodal: Two distinct peaks Multimodal: More than two peaks

Step 2: Analyze each scenario

a) Heights of all adult humans

  • Women cluster around ~5'4" (163 cm)
  • Men cluster around ~5'9" (175 cm)
  • Two distinct groups Answer: BIMODAL

b) Test scores with mostly A or F

  • Cluster around 90-100 (A students)
  • Cluster around 0-60 (F students)
  • Few in between (B, C, D) Answer: BIMODAL

c) Ages at kids movie

  • Young children (ages 5-12)
  • Parents (ages 30-45)
  • Possibly grandparents (ages 60-75)
  • Could have 2-3 distinct groups Answer: BIMODAL or MULTIMODAL (likely 2-3 peaks)

Answer: a) Bimodal (male and female heights) b) Bimodal (A and F peaks) c) Bimodal/Multimodal (children and adults)

4Problem 4medium

Question:

Describe this distribution using the SOCS framework (Shape, Outliers, Center, Spread): Data shows exam scores with most values between 70-85, mean=77, median=78, one score at 45, and range=40.

💡 Show Solution

SOCS Framework for describing distributions:

S - SHAPE: Mean (77) ≈ Median (78), very close This suggests roughly SYMMETRIC distribution However, presence of low outlier (45) suggests slight left skew Overall: Roughly symmetric, possibly slight left skew

O - OUTLIERS: Score of 45 is notably low With most scores 70-85 and one at 45: 45 is likely an outlier (more than 25 points below typical) Need to check with 1.5×IQR rule, but appears to be outlier

C - CENTER: Mean = 77 Median = 78 Typical exam score around 77-78 Mean slightly pulled down by low outlier

S - SPREAD: Range = 40 points (from 45 to 85) Most data in 70-85 range (about 15 points) Without outlier, spread would be smaller IQR likely around 10-15 points

Complete SOCS description: "The distribution of exam scores is roughly symmetric with a possible slight left skew due to one low outlier at 45. The center of the distribution is around 77-78 (mean and median nearly equal). The scores spread from 45 to 85, a range of 40 points, though most scores cluster between 70-85. The score of 45 appears to be an outlier, sitting well below the main body of data."

Answer: Symmetric/slight left skew, one low outlier (45), center ~77-78, range=40 with most data in 15-point range.

5Problem 5hard

Question:

Two distributions have the same mean (50) and same range (20-80). Distribution A is uniform (flat), while Distribution B is normal (bell-shaped). Which distribution would have a larger standard deviation, and why?

💡 Show Solution

Step 1: Visualize both distributions Both: Mean = 50, Range = 60 (from 20 to 80)

Distribution A (Uniform):

  • Data spread EVENLY from 20 to 80
  • Every value equally likely
  • Flat histogram

Distribution B (Normal):

  • Data concentrated near mean (50)
  • Fewer values at extremes (20 and 80)
  • Bell-shaped curve

Step 2: Understand standard deviation SD measures average distance from the mean Larger SD = more spread out from center

Step 3: Compare spread from mean

Distribution A (Uniform):

  • Many values far from mean (50)
  • Values at 20 and 80 are 30 units from mean
  • Lots of data at extremes
  • Higher average distance from mean

Distribution B (Normal):

  • Most data near mean (50)
  • Few values at 20 and 80
  • Less data at extremes
  • Lower average distance from mean

Step 4: Calculate mental estimate Uniform: Roughly SD ≈ range/3.5 ≈ 60/3.5 ≈ 17 Normal: Roughly SD ≈ range/6 ≈ 60/6 ≈ 10 (These are approximations)

Answer: Distribution A (uniform) has LARGER standard deviation because more of its data is spread far from the mean, while Distribution B (normal) has most data clustered near the center. Even with the same range, uniform distributions have more variability than normal distributions.