Describing Distributions

Shape, center, spread, and outliers (SOCS)

Describing Distributions

Introduction

Looking at a graph is just the first step. To fully understand data, we must describe what we see using precise statistical language. The framework SOCS (Shape, Outliers, Center, Spread) provides a systematic approach to describing any distribution.

Shape

Shape describes the overall pattern of the distribution.

Symmetry

Symmetric Distribution:

  • Left side mirrors right side
  • Mean ≈ Median
  • Balanced around center

Examples:

  • Normal (bell-shaped) distributions
  • Uniform distributions
  • Heights of adult males

How to identify: If you fold the distribution at the center, both sides match

Skewness

Right-Skewed (Positively Skewed):

  • Tail extends to the right
  • Mean > Median
  • Most data on left, few high values pull mean right

Examples:

  • Income (most people earn moderate amounts, few earn very high)
  • Home prices
  • Test scores when test is easy (most score high, few score low)

Visual: Peak on left, tail stretches right

Left-Skewed (Negatively Skewed):

  • Tail extends to the left
  • Mean < Median
  • Most data on right, few low values pull mean left

Examples:

  • Age at death (most live to old age, few die young)
  • Test scores when test is hard (most score low, few score high)

Visual: Peak on right, tail stretches left

Memory aid: Skewness direction = direction of the tail (not the peak!)

Modality

Number of peaks (modes) in distribution:

Unimodal: One clear peak

  • Most common pattern
  • Examples: heights, standardized test scores

Bimodal: Two distinct peaks

  • Suggests two different groups
  • Examples: Heights of adults (male peak and female peak)

Multimodal: More than two peaks

  • Multiple distinct groups
  • Less common

Uniform: No peaks, all values equally likely

  • Flat distribution
  • Example: Rolling a fair die

How to determine: Count prominent "humps" in the distribution

Special Shapes

Normal (Bell-Shaped):

  • Symmetric
  • Unimodal
  • Mean = Median = Mode
  • Most data near center, decreasing towards extremes
  • Follows empirical rule (68-95-99.7)

Uniform:

  • All values equally likely
  • Rectangular shape
  • No mode

Exponential:

  • Decreasing pattern
  • Extremely right-skewed
  • Many small values, few large values

Outliers

Outliers are observations that fall notably far from the overall pattern.

Identifying Outliers

Visual method:

  • Look for isolated points
  • Values separated from main cluster

1.5 × IQR Rule (for boxplots):

  • Calculate IQR=Q3Q1IQR = Q3 - Q1
  • Lower fence: Q11.5×IQRQ1 - 1.5 \times IQR
  • Upper fence: Q3+1.5×IQRQ3 + 1.5 \times IQR
  • Outliers fall beyond fences

Example:

  • Q1 = 65, Q3 = 85
  • IQR = 85 - 65 = 20
  • Lower fence: 65 - 1.5(20) = 65 - 30 = 35
  • Upper fence: 85 + 1.5(20) = 85 + 30 = 115
  • Values below 35 or above 115 are outliers

Standard deviation method:

  • Outliers > 2 or 3 standard deviations from mean
  • Less commonly used
  • Appropriate for symmetric distributions

Reporting Outliers

Always:

  • Note their presence: "There is one outlier at 150"
  • Give actual values if possible
  • Consider potential causes

Potential causes:

  • Measurement error: Mistake in recording
  • Data entry error: Typo when entering data
  • Legitimate extreme value: Unusual but real observation
  • Different population: Doesn't belong in this group

What to do:

  • Investigate cause if possible
  • Report with and without outliers (if they affect conclusions)
  • Don't automatically delete (unless proven error)

Center

Center describes the "typical" or "middle" value.

Mean vs. Median

When to use each:

Mean (xˉ\bar{x}):

  • Symmetric distributions
  • No outliers
  • Want to use all data values
  • Mathematical properties needed

Median:

  • Skewed distributions
  • Presence of outliers
  • Want resistant measure
  • Ordinal data

Relationship to shape:

  • Symmetric: Mean ≈ Median
  • Right-skewed: Mean > Median (mean pulled right by tail)
  • Left-skewed: Mean < Median (mean pulled left by tail)

Mode

Definition: Most frequently occurring value

When reported:

  • Categorical data
  • Describing bimodal distributions
  • Identifying popular values

Limitations:

  • May not exist (all values occur once)
  • May not be unique (multiple modes)
  • Not useful for continuous data with no repeated values

Spread

Spread describes the variability or dispersion of data.

Range

Definition: Maximum - Minimum

Formula: Range=MaxMinRange = Max - Min

Advantages:

  • Easy to calculate
  • Easy to understand
  • Gives sense of total spread

Disadvantages:

  • Affected by outliers
  • Ignores distribution between extremes
  • Only uses two values

Example:

  • Data: 12, 15, 18, 20, 22, 25, 100
  • Range = 100 - 12 = 88
  • Dominated by outlier (100)

Interquartile Range (IQR)

Definition: Range of middle 50% of data

Formula: IQR=Q3Q1IQR = Q3 - Q1

Advantages:

  • Resistant to outliers
  • Focuses on middle of distribution
  • Useful for boxplots

Disadvantages:

  • Ignores lowest 25% and highest 25%
  • Less intuitive than range

Example:

  • Q1 = 65, Q3 = 85
  • IQR = 85 - 65 = 20
  • Middle 50% of data spans 20 points

Interpretation: "Half the data falls within [IQR] points"

Standard Deviation

Definition: Average distance from the mean

Interpretation: Typical deviation from mean

Advantages:

  • Uses all data values
  • Has important mathematical properties
  • Basis for many statistical methods

Disadvantages:

  • Affected by outliers
  • Less intuitive than range
  • Only meaningful for roughly symmetric distributions

When to report:

  • Symmetric distributions
  • No extreme outliers
  • Want to use standard statistical methods

Context Matters!

Units

Always include units in descriptions:

❌ "The mean is 68"
✓ "The mean height is 68 inches"

❌ "The standard deviation is 3.5"
✓ "The standard deviation of test scores is 3.5 points"

Comparison

Describe in context of:

  • What you'd expect
  • Other groups
  • Previous studies

Examples:

  • "Students averaged 85%, which is higher than last year's 78%"
  • "The standard deviation of 15 points shows high variability"

Complete Description Template

A complete distribution description includes:

Shape: "The distribution of [variable] is [symmetric/right-skewed/left-skewed] and [unimodal/bimodal/etc.]"

Outliers: "There is/are [number] outlier(s) at [value(s)]" or "There are no apparent outliers"

Center: "The [mean/median] [variable] is [value with units]"

Spread: "The [variable] ranges from [min] to [max] [units]" or "The standard deviation is [value] [units]"

Example:

Data: Test scores in AP Statistics class

"The distribution of test scores is slightly right-skewed and unimodal with one outlier at 45%. The median score is 82%, indicating that half the students scored below 82%. Scores range from 45% to 98%, with an IQR of 12 percentage points, meaning the middle 50% of students scored within a 12-point range. The outlier at 45% is notably below the main cluster of scores between 70% and 98%."

Common Patterns and Interpretations

What Shape Tells Us

Symmetric:

  • Process or measurement is balanced
  • Natural variation around center
  • Use mean and standard deviation

Right-skewed:

  • Floor effect (minimum limit)
  • Most values small, few very large
  • Use median and IQR

Left-skewed:

  • Ceiling effect (maximum limit)
  • Most values large, few very small
  • Use median and IQR

Bimodal:

  • Two distinct groups mixed together
  • Consider separating and analyzing separately

What Outliers Tell Us

Potential meanings:

  • Errors (investigate and possibly correct)
  • Unusual but legitimate cases
  • Different population mixed in
  • Rare but important events

Impact:

  • Affect mean more than median
  • Affect standard deviation more than IQR
  • Can change conclusions if not addressed

What Spread Tells Us

Large spread:

  • High variability
  • Data quite different from typical value
  • Less predictability

Small spread:

  • Low variability
  • Data close to typical value
  • More consistency, predictability

Comparing Distributions

When comparing two or more distributions:

Address each of SOCS:

Shape:

  • "Group A is symmetric while Group B is right-skewed"

Outliers:

  • "Both groups have outliers, but Group A's are more extreme"

Center:

  • "Group A has a higher median (75) than Group B (68)"

Spread:

  • "Group A shows more variability (SD = 12) than Group B (SD = 8)"

Example comparison:

"Both male and female height distributions are roughly symmetric and unimodal. Males have a higher mean height (70 inches) compared to females (64 inches), a difference of 6 inches. Both distributions have similar spreads, with standard deviations of approximately 3 inches. Neither distribution shows outliers."

Common Mistakes

Confusing skewness direction (it's the tail, not the peak!)
Using mean with skewed data (median is more appropriate)
Reporting center without spread (both are needed!)
Ignoring units (always include them)
Incomplete descriptions (use full SOCS framework)
Not describing in context (relate to actual situation)
Confusing SD and IQR (they measure spread differently)

Quick Reference

SOCS Framework:

  • Shape: Symmetric? Skewed (which direction)? Unimodal/bimodal?
  • Outliers: Present? Where? How many?
  • Center: Mean or median (with units!)
  • Spread: Range, IQR, or SD (with units!)

Mean vs. Median:

  • Symmetric, no outliers → Use mean
  • Skewed or outliers → Use median

SD vs. IQR:

  • Symmetric, no outliers → Use SD
  • Skewed or outliers → Use IQR

Skewness:

  • Right-skewed: Mean > Median, tail to right
  • Left-skewed: Mean < Median, tail to left
  • Symmetric: Mean ≈ Median

Remember: A complete description tells the story of the data. Don't just report numbers — interpret them in context and explain what they mean!

📚 Practice Problems

No example problems available yet.