Outliers in Data

Identify and analyze outliers

Outliers in Data

What is an Outlier?

An outlier is a data value that is significantly different from the other values in a data set.

Think of it as: A data point that "stands out" or "doesn't fit"

Examples:

  • Test scores: 78, 82, 85, 79, 83, 15 (15 is an outlier!)
  • Heights: 65, 67, 64, 68, 120 (120 inches is an outlier!)
  • Prices: 10, 12, 11, 13, 95 (95 is an outlier!)

Key point: Outliers are unusually high OR unusually low

Why Outliers Matter

1. Affect mean (average) Mean is sensitive to outliers!

Example: Salaries: 40k, 42k, 45k, 43k, 200k

  • With outlier: Mean = 74k
  • Without outlier: Mean = 42.5k

Big difference!

2. Don't affect median much Median is resistant to outliers

Same example:

  • With outlier: Median = 43k
  • Without outlier: Median = 42.5k

Small difference!

3. Can indicate errors

  • Measurement mistakes
  • Data entry errors
  • Recording problems

4. Can reveal important information

  • Exceptional cases
  • New discoveries
  • Special circumstances

Identifying Outliers: The IQR Method

Most common method: 1.5 × IQR rule

Steps:

Step 1: Find Q1 and Q3

Step 2: Calculate IQR = Q3 - Q1

Step 3: Calculate boundaries

  • Lower boundary: Q1 - 1.5(IQR)
  • Upper boundary: Q3 + 1.5(IQR)

Step 4: Any value outside boundaries is an outlier

Example: Data: 2, 5, 7, 8, 9, 10, 12, 15, 40

Q1 = 7 (25th percentile) Q3 = 12 (75th percentile) IQR = 12 - 7 = 5

Lower boundary: 7 - 1.5(5) = 7 - 7.5 = -0.5 Upper boundary: 12 + 1.5(5) = 12 + 7.5 = 19.5

Outlier: 40 (greater than 19.5)

Example 2: Test scores: 65, 70, 72, 75, 78, 80, 82, 85, 88, 20

Order: 20, 65, 70, 72, 75, 78, 80, 82, 85, 88

Q1 = 70 Q3 = 82 IQR = 12

Lower: 70 - 1.5(12) = 70 - 18 = 52 Upper: 82 + 1.5(12) = 82 + 18 = 100

Outlier: 20 (less than 52)

Why 1.5 × IQR?

The 1.5 multiplier is a convention:

  • Widely accepted in statistics
  • Balances sensitivity (finding real outliers) with specificity (not flagging too many)
  • Works well for many distributions
  • Used by box plots

Alternatives exist:

  • 2 × IQR (more conservative, fewer outliers)
  • 3 × IQR (very conservative, extreme outliers only)
  • Standard deviation method (for normal distributions)

In Algebra 1: Stick with 1.5 × IQR unless told otherwise

Visual Identification

From dot plots, histograms, box plots:

Look for values far separated from the main cluster

Example: Dot plot

Values 10 through 15 have most of the data points clustered together, but value 25 has a single point far separated from the cluster. The point at 25 is separated from cluster at 10-15 and is likely an outlier.

From box plots: Outliers often shown as individual points beyond whiskers

From scatter plots: Points far from the trend line or main cluster

Types of Outliers

1. Mild outliers:

  • Between 1.5 and 3 IQRs from Q1/Q3
  • Somewhat unusual

2. Extreme outliers:

  • More than 3 IQRs from Q1/Q3
  • Very unusual

Example: Q1 = 10, Q3 = 20, IQR = 10

Mild outlier range:

  • Lower: 10 - 1.5(10) to 10 - 3(10) = -5 to -20
  • Upper: 20 + 1.5(10) to 20 + 3(10) = 35 to 50

Extreme outlier:

  • Below -20 or above 50

Effect on Measures of Center

Mean:

  • Very sensitive to outliers
  • Pulled toward outlier
  • Can be misleading with outliers

Example: 10, 12, 13, 14, 15, 100

Mean with outlier: (10+12+13+14+15+100)/6 = 27.3 Mean without: (10+12+13+14+15)/5 = 12.8

Huge difference!

Median:

  • Resistant to outliers
  • Not pulled significantly
  • Better measure when outliers present

Same example: Median with outlier: 13.5 Median without: 13

Small difference!

Mode:

  • Not affected by outliers
  • Only shows most frequent value

Effect on Measures of Spread

Range:

  • Very sensitive (uses min and max)
  • Outliers inflate range

Example: 5, 7, 8, 9, 10, 50

Range with outlier: 50 - 5 = 45 Range without: 10 - 5 = 5

IQR:

  • Resistant to outliers
  • Only uses middle 50%
  • Better measure when outliers present

Same example: IQR with outlier: Q3 - Q1 = 10 - 7 = 3 IQR without: 9 - 7 = 2

Less dramatic change

Standard deviation:

  • Sensitive to outliers (in advanced statistics)
  • Outliers increase variability

Causes of Outliers

1. Measurement error:

  • Instrument malfunction
  • Human error reading/recording
  • Transcription mistake

Example: Recording 150 instead of 15.0

2. Data entry error:

  • Typo when entering data
  • Extra or missing digit
  • Wrong decimal place

Example: Typing 1000 instead of 100

3. Sampling error:

  • Wrong population sampled
  • Non-random selection

4. Natural variation:

  • True extreme value
  • Rare but real occurrence

Example: Unusually tall person, genius IQ, record temperature

5. Different population:

  • Value from different group

Example: Adult height in data of children's heights

What to Do with Outliers

Option 1: Investigate

  • Check for errors
  • Verify measurement
  • Look for explanation

Option 2: Keep

  • If legitimate data point
  • If represents true variation
  • Document its presence

Option 3: Remove

  • If proven error
  • If not from target population
  • Report that you removed it!

NEVER: Remove without reason or justification!

Best practice:

  • Analyze data both with and without outlier
  • Report both results
  • Explain any removal decision

Reporting Outliers

When writing about data:

"The data set contains one outlier (value = 95), which is more than 1.5 IQRs above Q3. This value appears to be a data entry error based on the source document, so it was excluded from further analysis."

OR:

"One outlier (150) was identified but retained because it represents a legitimate extreme value."

Be transparent!

Real-World Examples

Example 1: Income Data

Incomes: 35k, 40k, 42k, 38k, 45k, 2M (CEO)

2M is an outlier

  • Median better than mean for "typical" income
  • Outlier is real (some people earn much more)
  • Keep it, but use median for reporting

Example 2: Test Scores

Scores: 78, 82, 85, 88, 90, 15

15 is an outlier

  • Likely student left early or didn't try
  • Or answer sheet error
  • Investigate before deciding

Example 3: Product Weights

Weights (grams): 100, 101, 99, 102, 150

150 is an outlier

  • Possible production error
  • Check batch records
  • May need quality control adjustment

Example 4: Reaction Times

Times (seconds): 0.8, 0.9, 0.85, 0.82, 5.2

5.2 is an outlier

  • Person distracted?
  • Timer error?
  • Investigate before removing

Outliers in Different Contexts

Science experiments:

  • May indicate errors
  • Could be breakthrough discovery
  • Repeat to verify

Quality control:

  • Often indicate defects
  • Trigger inspection
  • May lead to process improvement

Sports statistics:

  • Record-breaking performances
  • Exceptional talent
  • Keep for historical record

Economic data:

  • Market crashes/booms
  • Unusual events
  • Important to analyze separately

Multiple Outliers

Data can have more than one!

Example: 5, 8, 10, 12, 15, 18, 75, 80

Both 75 and 80 are outliers (using IQR method)

Clustered outliers:

  • Multiple outliers grouped together
  • May indicate subpopulation
  • Consider separate analysis

Outliers in Box Plots

Standard representation:

  • Draw whiskers to last non-outlier
  • Mark outliers as individual points (dots)
  • Clearly visible

Example: In a box plot, an outlier would be marked as a dot beyond the whiskers, with the whiskers extending only to the last non-outlier value in the normal data range.

Benefits:

  • Quick visual identification
  • See number of outliers
  • See if high or low

Z-Score Method (Preview)

Alternative method using standard deviation:

z = (value - mean) / standard deviation

Rule: If |z| > 3, likely outlier (In some contexts, |z| > 2)

Example: Mean = 50, SD = 5

Value = 70 z = (70 - 50) / 5 = 4

Since 4 > 3, value 70 is an outlier

Note: This is more common in advanced statistics

Practice Identifying Outliers

Example 1: 12, 15, 18, 20, 22, 25, 28, 65

Order: Already ordered Q1 = 16.5, Q3 = 26.5, IQR = 10

Lower: 16.5 - 15 = 1.5 Upper: 26.5 + 15 = 41.5

Outlier: 65 (> 41.5)

Example 2: 2, 3, 5, 7, 8, 9, 10, 11, 12

Q1 = 5, Q3 = 10, IQR = 5

Lower: 5 - 7.5 = -2.5 Upper: 10 + 7.5 = 17.5

No outliers (all values between -2.5 and 17.5)

Example 3: 50, 55, 60, 62, 65, 68, 70, 72, 120

Q1 = 60, Q3 = 70, IQR = 10

Lower: 60 - 15 = 45 Upper: 70 + 15 = 85

Outlier: 120 (> 85)

Common Mistakes to Avoid

  1. Automatically removing outliers Must investigate first!

  2. Using range instead of IQR IQR is resistant to outliers, range is not

  3. Wrong IQR calculation IQR = Q3 - Q1 (not max - min!)

  4. Forgetting both boundaries Check both lower and upper limits

  5. Calculation errors with 1.5 1.5 × IQR, not 1.5 + IQR!

  6. Not considering context Is the outlier meaningful or an error?

  7. Not reporting removals Always document if you exclude data

Outliers and Technology

Calculators:

  • Many show outliers on box plots
  • Can calculate quartiles automatically

Spreadsheets:

  • Use QUARTILE function
  • Create formulas for boundaries
  • Conditional formatting to highlight

Statistical software:

  • Automatic outlier detection
  • Multiple methods available
  • Visual displays

When Outliers Are Most Important

Quality control: Outliers indicate defects

Medical data: Unusual values may indicate health issues

Fraud detection: Unusual transactions flagged

Climate data: Extreme values important for planning

Safety analysis: Worst-case scenarios matter

Outliers vs Extreme Values

Not all extreme values are outliers!

Extreme value: At the far end of distribution Outlier: Statistically defined as beyond 1.5 IQR

Example: Tallest person in class

Might be extreme (tallest) but not outlier (still within 1.5 IQR)

Example 2: Record high temperature

Extreme and probably an outlier

Quick Reference

Outlier: Data value far from others

IQR Method:

  • Lower boundary: Q1 - 1.5(IQR)
  • Upper boundary: Q3 + 1.5(IQR)
  • Outside boundaries = outlier

Effects:

  • Mean: Very sensitive
  • Median: Resistant
  • Range: Sensitive
  • IQR: Resistant

Actions:

  1. Investigate
  2. Keep if legitimate
  3. Remove if error (and report!)

Never: Remove without reason

In box plots: Shown as individual points

Practice Strategy

  • Calculate IQR carefully
  • Don't forget the 1.5 multiplier
  • Check both upper and lower boundaries
  • Consider context and cause
  • Practice with various data sets
  • Learn to identify visually from graphs
  • Understand effect on mean vs median
  • Compare statistics with and without outliers
  • Read real-world examples
  • Use technology to verify
  • Always investigate before removing
  • Document your decisions
  • Understand that outliers aren't always errors
  • Practice explaining outliers to others
  • Apply to real data from your life

Understanding outliers is crucial for accurate data analysis. They can reveal errors, exceptional cases, or important patterns. Master this skill and you'll be a more critical and careful data analyst!

📚 Practice Problems

1Problem 1easy

Question:

Is 2 an outlier in the data set: 12, 15, 18, 20, 22, 25, 2?

💡 Show Solution

Step 1: Arrange data in order: 2, 12, 15, 18, 20, 22, 25

Step 2: Visual inspection: 2 is much smaller than all other values (which range from 12-25). It appears to be an outlier.

Step 3: Use the IQR method to confirm: Find Q1 and Q3: Q1 = 12 (median of lower half: 2, 12, 15) Q2 = 18 (median overall) Q3 = 22 (median of upper half: 20, 22, 25)

Step 4: Calculate IQR and boundaries: IQR = 22 - 12 = 10 Lower boundary: Q1 - 1.5(IQR) = 12 - 15 = -3 Upper boundary: Q3 + 1.5(IQR) = 22 + 15 = 37

Step 5: Check if 2 is outside the boundaries: 2 > -3 and 2 < 37 2 is NOT outside the boundaries by the 1.5 × IQR rule.

Answer: By the IQR method, 2 is technically NOT an outlier, though it appears unusual visually.

2Problem 2easy

Question:

For the data set 5, 8, 10, 12, 15, 40, identify any outliers using the 1.5 × IQR rule.

💡 Show Solution

Step 1: Data is already in order. Find quartiles: Q1 = 8 (median of 5, 8, 10) Q2 = 11 (median of all: between 10 and 12) Q3 = 15 (median of 12, 15, 40)

Step 2: Calculate IQR: IQR = Q3 - Q1 = 15 - 8 = 7

Step 3: Calculate boundaries: Lower: Q1 - 1.5(IQR) = 8 - 1.5(7) = 8 - 10.5 = -2.5 Upper: Q3 + 1.5(IQR) = 15 + 1.5(7) = 15 + 10.5 = 25.5

Step 4: Check each value: 5 > -2.5 ✓ 8, 10, 12, 15 all within boundaries ✓ 40 > 25.5 ✗ (exceeds upper boundary)

Step 5: Identify outliers: 40 is an outlier

Answer: 40 is an outlier

3Problem 3medium

Question:

A data set has Q1 = 30, Q3 = 50. What values would be considered outliers?

💡 Show Solution

Step 1: Calculate IQR: IQR = Q3 - Q1 = 50 - 30 = 20

Step 2: Calculate lower boundary: Lower boundary = Q1 - 1.5(IQR) = 30 - 1.5(20) = 30 - 30 = 0

Step 3: Calculate upper boundary: Upper boundary = Q3 + 1.5(IQR) = 50 + 1.5(20) = 50 + 30 = 80

Step 4: Determine outlier ranges: Any value less than 0 is a low outlier Any value greater than 80 is a high outlier

Answer: Values below 0 or above 80 are outliers

4Problem 4medium

Question:

Explain how outliers affect the mean and median differently.

💡 Show Solution

Step 1: Effect on the mean (average): The mean is calculated by adding all values and dividing by the count. Outliers significantly affect the mean because they are included in the sum.

Example: Data set: 10, 12, 13, 15, 100 Without 100: mean = (10 + 12 + 13 + 15)/4 = 12.5 With 100: mean = (10 + 12 + 13 + 15 + 100)/5 = 30

The outlier (100) dramatically increases the mean from 12.5 to 30.

Step 2: Effect on the median (middle value): The median is the middle value when data is ordered. Outliers have little to no effect on the median.

Same data: 10, 12, 13, 15, 100 Median = 13 (middle value)

If we remove 100: 10, 12, 13, 15 Median = (12 + 13)/2 = 12.5

The median changed only slightly (13 to 12.5).

Step 3: Conclusion:

  • Mean is sensitive to outliers (not resistant)
  • Median is resistant to outliers
  • When outliers exist, median often better represents "typical" value

Answer: Outliers significantly affect the mean but have minimal effect on the median. The median is resistant to outliers.

5Problem 5hard

Question:

Test scores: 72, 75, 78, 80, 82, 85, 88, 90, 45. Is 45 an outlier? Should it be removed from the data?

💡 Show Solution

Step 1: Order the data and find quartiles: 45, 72, 75, 78, 80, 82, 85, 88, 90

Q1 = 75 (median of 45, 72, 75, 78) Q2 = 80 Q3 = 85 (median of 82, 85, 88, 90)

Step 2: Calculate IQR and boundaries: IQR = 85 - 75 = 10 Lower: 75 - 1.5(10) = 75 - 15 = 60 Upper: 85 + 1.5(10) = 85 + 15 = 100

Step 3: Check if 45 is an outlier: 45 < 60, so YES, 45 is an outlier

Step 4: Investigate the cause: Ask: Why is this score so different? Possible reasons:

  • Student was absent and made up test later
  • Student had an emergency during test
  • Data entry error (typed 45 instead of 85?)
  • Student genuinely struggled

Step 5: Decide whether to remove it:

  • If it's a data error: Remove or correct it
  • If it's a legitimate score: Keep it, but note it
  • Report statistics with and without the outlier
  • Use median instead of mean to reduce its impact

Step 6: Calculate both scenarios: With 45: mean ≈ 76.1, median = 80 Without 45: mean ≈ 81.25, median = 81

Answer: Yes, 45 is an outlier. Whether to remove it depends on why it occurred. If legitimate, keep it but use resistant measures like median. If it's an error, investigate and correct.