Outliers in Data - Complete Interactive Lesson
Part 1: What Is an Outlier?
๐ฏ Outliers in Data
Part 1 of 5 โ What Is an Outlier?
Topics in This Part
| Section |
|---|
| The Idea of an Outlier |
| Spotting Outliers by Eye |
| Why a Single Value Can Mislead |
๐ Key Concept: An outlier is a data value that lies unusually far from the rest of the data. One stray value can quietly distort an average, stretch a graph, and lead to wrong conclusions โ so learning to spot and handle outliers is a core data skill.
The Idea of an Outlier
Imagine seven students report how many minutes it took them to get to school:
Six of the values huddle between and minutes. Then there's a โ more than double the next-largest value. That lonely is an outlier: a value that doesn't fit the pattern of the rest.
Outliers can be:
- Real and important โ maybe that student walked from across town, or a sensor caught a genuine spike.
- Mistakes โ a typo ( instead of ), a broken instrument, or the wrong units.
๐ก An outlier is not automatically "bad data." It is simply a value worth a second look. The job of statistics is to flag it; the job of a thoughtful person is to investigate why it's there.
Concept Check ๐ฏ
Why a Single Value Can Mislead
Return to the commute times. Their mean (average) is:
See the Pull ๐งฎ
A small shop records the ages of customers:
1) Mean of all five ages (it divides evenly) 2) Mean of just the first four ages (the outlier removed)
Spotting by Eye Is Not Enough
In the examples above, the outliers were obvious. But what about a value that is somewhat large โ is an outlier in a data set that runs to ? Eyeballing it is risky and subjective.
We need an objective rule that any two people would apply the same way. There are two standard ones:
| Rule | Idea |
|---|---|
| rule |
First Ideas ๐ฝ
Lock in the vocabulary from Part 1.
Part 2: The 1.5 ร IQR Rule
๐ฏ Outliers in Data
Part 2 of 5 โ The Rule
๐ The Idea: Measure the spread of the middle half of the data with the IQR, then build two fences. Any value beyond a fence is officially an outlier.
Step 1 โ Find the Quartiles and the IQR
The interquartile range (IQR) is the spread of the middle of the data:
Part 3: The Standard-Deviation Rule
๐ฏ Outliers in Data
Part 3 of 5 โ The Standard-Deviation Rule
๐ A second test: Instead of quartiles, measure how many standard deviations a value sits from the mean. If it is more than about away, it is unusually far out.
Measuring Distance in Standard Deviations
The standard deviation measures the typical distance of values from the mean . We can rewrite any value as a number of standard deviations away from the mean โ called its -score:
Part 4: How Outliers Affect Statistics
๐ฏ Outliers in Data
Part 4 of 5 โ How Outliers Affect Statistics
๐ Big Idea: Some statistics get yanked around by outliers and some shrug them off. Knowing which is which tells you the right numbers to report when an outlier is present.
Resistant vs. Non-Resistant
A statistic is resistant (or robust) if a single outlier barely changes it. It is non-resistant if an outlier can move it a lot.
| Statistic | Measures | Resistant to outliers? |
|---|---|---|
| Mean | center | โ No โ gets pulled toward the outlier |
| Median | center | โ Yes โ only the middle position matters |
| Range | spread | โ No โ uses the very extremes |
| IQR | spread | โ Yes โ uses only the middle half |
| Standard deviation | spread | โ No โ squares the distance to the mean |
๐ When a data set contains an outlier, report the for center and the for spread. They describe the value honestly.
Part 5: Causes, Decisions & Mastery Check
๐ฏ Outliers in Data
Part 5 of 5 โ Causes, Decisions & Mastery Check
You can now (1) recognize an outlier, (2) flag one with the rule, (3) flag one with the -score rule, and (4) describe how outliers warp the mean, range, and standard deviation. The last skill is judgment: what to do once you've found one.
Where Outliers Come From โ and What to Do
| Cause | Example | Sensible action |
|---|---|---|
| Data-entry error | typed for |