Describing Distributions
Shape, center, spread, and outliers (SOCS)
Describing Distributions
Introduction
Looking at a graph is just the first step. To fully understand data, we must describe what we see using precise statistical language. The framework SOCS (Shape, Outliers, Center, Spread) provides a systematic approach to describing any distribution.
Shape
Shape describes the overall pattern of the distribution.
Symmetry
Symmetric Distribution:
- Left side mirrors right side
- Mean ≈ Median
- Balanced around center
Examples:
- Normal (bell-shaped) distributions
- Uniform distributions
- Heights of adult males
How to identify: If you fold the distribution at the center, both sides match
Skewness
Right-Skewed (Positively Skewed):
- Tail extends to the right
- Mean > Median
- Most data on left, few high values pull mean right
Examples:
- Income (most people earn moderate amounts, few earn very high)
- Home prices
- Test scores when test is easy (most score high, few score low)
Visual: Peak on left, tail stretches right
Left-Skewed (Negatively Skewed):
- Tail extends to the left
- Mean < Median
- Most data on right, few low values pull mean left
Examples:
- Age at death (most live to old age, few die young)
- Test scores when test is hard (most score low, few score high)
Visual: Peak on right, tail stretches left
Memory aid: Skewness direction = direction of the tail (not the peak!)
Modality
Number of peaks (modes) in distribution:
Unimodal: One clear peak
- Most common pattern
- Examples: heights, standardized test scores
Bimodal: Two distinct peaks
- Suggests two different groups
- Examples: Heights of adults (male peak and female peak)
Multimodal: More than two peaks
- Multiple distinct groups
- Less common
Uniform: No peaks, all values equally likely
- Flat distribution
- Example: Rolling a fair die
How to determine: Count prominent "humps" in the distribution
Special Shapes
Normal (Bell-Shaped):
- Symmetric
- Unimodal
- Mean = Median = Mode
- Most data near center, decreasing towards extremes
- Follows empirical rule (68-95-99.7)
Uniform:
- All values equally likely
- Rectangular shape
- No mode
Exponential:
- Decreasing pattern
- Extremely right-skewed
- Many small values, few large values
Outliers
Outliers are observations that fall notably far from the overall pattern.
Identifying Outliers
Visual method:
- Look for isolated points
- Values separated from main cluster
1.5 × IQR Rule (for boxplots):
- Calculate
- Lower fence:
- Upper fence:
- Outliers fall beyond fences
Example:
- Q1 = 65, Q3 = 85
- IQR = 85 - 65 = 20
- Lower fence: 65 - 1.5(20) = 65 - 30 = 35
- Upper fence: 85 + 1.5(20) = 85 + 30 = 115
- Values below 35 or above 115 are outliers
Standard deviation method:
- Outliers > 2 or 3 standard deviations from mean
- Less commonly used
- Appropriate for symmetric distributions
Reporting Outliers
Always:
- Note their presence: "There is one outlier at 150"
- Give actual values if possible
- Consider potential causes
Potential causes:
- Measurement error: Mistake in recording
- Data entry error: Typo when entering data
- Legitimate extreme value: Unusual but real observation
- Different population: Doesn't belong in this group
What to do:
- Investigate cause if possible
- Report with and without outliers (if they affect conclusions)
- Don't automatically delete (unless proven error)
Center
Center describes the "typical" or "middle" value.
Mean vs. Median
When to use each:
Mean ():
- Symmetric distributions
- No outliers
- Want to use all data values
- Mathematical properties needed
Median:
- Skewed distributions
- Presence of outliers
- Want resistant measure
- Ordinal data
Relationship to shape:
- Symmetric: Mean ≈ Median
- Right-skewed: Mean > Median (mean pulled right by tail)
- Left-skewed: Mean < Median (mean pulled left by tail)
Mode
Definition: Most frequently occurring value
When reported:
- Categorical data
- Describing bimodal distributions
- Identifying popular values
Limitations:
- May not exist (all values occur once)
- May not be unique (multiple modes)
- Not useful for continuous data with no repeated values
Spread
Spread describes the variability or dispersion of data.
Range
Definition: Maximum - Minimum
Formula:
Advantages:
- Easy to calculate
- Easy to understand
- Gives sense of total spread
Disadvantages:
- Affected by outliers
- Ignores distribution between extremes
- Only uses two values
Example:
- Data: 12, 15, 18, 20, 22, 25, 100
- Range = 100 - 12 = 88
- Dominated by outlier (100)
Interquartile Range (IQR)
Definition: Range of middle 50% of data
Formula:
Advantages:
- Resistant to outliers
- Focuses on middle of distribution
- Useful for boxplots
Disadvantages:
- Ignores lowest 25% and highest 25%
- Less intuitive than range
Example:
- Q1 = 65, Q3 = 85
- IQR = 85 - 65 = 20
- Middle 50% of data spans 20 points
Interpretation: "Half the data falls within [IQR] points"
Standard Deviation
Definition: Average distance from the mean
Interpretation: Typical deviation from mean
Advantages:
- Uses all data values
- Has important mathematical properties
- Basis for many statistical methods
Disadvantages:
- Affected by outliers
- Less intuitive than range
- Only meaningful for roughly symmetric distributions
When to report:
- Symmetric distributions
- No extreme outliers
- Want to use standard statistical methods
Context Matters!
Units
Always include units in descriptions:
❌ "The mean is 68"
✓ "The mean height is 68 inches"
❌ "The standard deviation is 3.5"
✓ "The standard deviation of test scores is 3.5 points"
Comparison
Describe in context of:
- What you'd expect
- Other groups
- Previous studies
Examples:
- "Students averaged 85%, which is higher than last year's 78%"
- "The standard deviation of 15 points shows high variability"
Complete Description Template
A complete distribution description includes:
Shape: "The distribution of [variable] is [symmetric/right-skewed/left-skewed] and [unimodal/bimodal/etc.]"
Outliers: "There is/are [number] outlier(s) at [value(s)]" or "There are no apparent outliers"
Center: "The [mean/median] [variable] is [value with units]"
Spread: "The [variable] ranges from [min] to [max] [units]" or "The standard deviation is [value] [units]"
Example:
Data: Test scores in AP Statistics class
"The distribution of test scores is slightly right-skewed and unimodal with one outlier at 45%. The median score is 82%, indicating that half the students scored below 82%. Scores range from 45% to 98%, with an IQR of 12 percentage points, meaning the middle 50% of students scored within a 12-point range. The outlier at 45% is notably below the main cluster of scores between 70% and 98%."
Common Patterns and Interpretations
What Shape Tells Us
Symmetric:
- Process or measurement is balanced
- Natural variation around center
- Use mean and standard deviation
Right-skewed:
- Floor effect (minimum limit)
- Most values small, few very large
- Use median and IQR
Left-skewed:
- Ceiling effect (maximum limit)
- Most values large, few very small
- Use median and IQR
Bimodal:
- Two distinct groups mixed together
- Consider separating and analyzing separately
What Outliers Tell Us
Potential meanings:
- Errors (investigate and possibly correct)
- Unusual but legitimate cases
- Different population mixed in
- Rare but important events
Impact:
- Affect mean more than median
- Affect standard deviation more than IQR
- Can change conclusions if not addressed
What Spread Tells Us
Large spread:
- High variability
- Data quite different from typical value
- Less predictability
Small spread:
- Low variability
- Data close to typical value
- More consistency, predictability
Comparing Distributions
When comparing two or more distributions:
Address each of SOCS:
Shape:
- "Group A is symmetric while Group B is right-skewed"
Outliers:
- "Both groups have outliers, but Group A's are more extreme"
Center:
- "Group A has a higher median (75) than Group B (68)"
Spread:
- "Group A shows more variability (SD = 12) than Group B (SD = 8)"
Example comparison:
"Both male and female height distributions are roughly symmetric and unimodal. Males have a higher mean height (70 inches) compared to females (64 inches), a difference of 6 inches. Both distributions have similar spreads, with standard deviations of approximately 3 inches. Neither distribution shows outliers."
Common Mistakes
❌ Confusing skewness direction (it's the tail, not the peak!)
❌ Using mean with skewed data (median is more appropriate)
❌ Reporting center without spread (both are needed!)
❌ Ignoring units (always include them)
❌ Incomplete descriptions (use full SOCS framework)
❌ Not describing in context (relate to actual situation)
❌ Confusing SD and IQR (they measure spread differently)
Quick Reference
SOCS Framework:
- Shape: Symmetric? Skewed (which direction)? Unimodal/bimodal?
- Outliers: Present? Where? How many?
- Center: Mean or median (with units!)
- Spread: Range, IQR, or SD (with units!)
Mean vs. Median:
- Symmetric, no outliers → Use mean
- Skewed or outliers → Use median
SD vs. IQR:
- Symmetric, no outliers → Use SD
- Skewed or outliers → Use IQR
Skewness:
- Right-skewed: Mean > Median, tail to right
- Left-skewed: Mean < Median, tail to left
- Symmetric: Mean ≈ Median
Remember: A complete description tells the story of the data. Don't just report numbers — interpret them in context and explain what they mean!
📚 Practice Problems
No example problems available yet.
Practice with Flashcards
Review key concepts with our flashcard system
Browse All Topics
Explore other calculus topics