Data Compression & Analysis - Complete Interactive Lesson
Part 1: Core Concepts
๐ฆ Data Compression & Analysis
Part 1 of 7 โ Lossless vs Lossy, Compression Techniques, and Data Analysis
Why Compress Data?
Smaller files mean:
- Faster transmission over networks
- Less storage space needed
- Lower bandwidth usage
- Faster loading for users
Lossless vs Lossy Compression
| Type | Data Loss? | Quality | Smaller? | Use Cases |
|---|---|---|---|---|
| Lossless | No โ original perfectly restored | Identical to original | Moderate reduction | Text, code, spreadsheets, medical images |
| Lossy | Yes โ some data permanently removed | Slightly reduced | Much smaller | Photos (JPEG), music (MP3), video (MP4) |
Lossless Example: Run-Length Encoding
Original: AAABBBCCDDDDDD
Compressed: 3A3B2C6D
The original can be perfectly reconstructed from the compressed version.
Lossy Example: JPEG Compression
A photograph has millions of color values. JPEG removes subtle color differences that human eyes cannot easily detect. The file shrinks dramatically, but the removed data cannot be recovered.
๐ Lossless = perfect reconstruction. Lossy = smaller file but permanent data loss. Choose based on whether quality loss is acceptable.
Concept Check ๐ฏ
Data Analysis and Visualization
Extracting Patterns from Data
When working with large datasets, visualization reveals patterns that raw numbers cannot.
| Visualization | Best For |
|---|---|
| Bar chart | Comparing categories |
| Line graph | Showing trends over time |
| Scatter plot | Showing relationships between two variables |
| Pie chart | Showing parts of a whole |
| Histogram | Showing frequency distributions |
Filtering and Transforming Data
// Filter: Keep only rows where score > 80
// Sort: Order by date ascending
// Aggregate: Calculate average score per student
Interpreting Results
- Look for trends (increasing, decreasing, stable)
- Identify outliers (values far from the norm)
- Check for clusters (groups of similar data points)
- Be cautious of bias in data collection
Challenges with Large Datasets
- Storage: Require significant space
- Privacy: May contain sensitive personal information
- Accuracy: Errors are amplified at scale
- Bias: If collection methods are biased, conclusions will be skewed
Applied Recall โ๏ธ
-
Compression that can perfectly restore the original data is called _______ compression.
-
MP3 audio files use _______ compression, permanently removing some frequencies.
-
A chart that shows the relationship between two variables using dots is called a _______ plot.
Classify the Compression ๐
AP Exam Strategy: Data Compression & Analysis
- Know the difference: lossless (ZIP, PNG) vs lossy (JPEG, MP3, MP4)
- Lossless for critical data (medical, legal, code). Lossy for media where small quality loss is acceptable
- Run-length encoding is the lossless technique the AP exam expects you to understand
- Data analysis questions test your ability to interpret visualizations and identify patterns
- Correlation does not imply causation โ always consider confounding variables
- Large datasets can reveal patterns but also amplify errors and privacy concerns
AP-Style Application ๐ฏ
Part 2: Key Processes
๐๏ธ Data Compression & Analysis
Part 2 of 7 โ Key Processes
Compression: Same Information, Fewer Bits
Compression takes a sequence of bits and produces a shorter sequence that decodes back to the original (lossless) or to an approximation of it (lossy).
| Type | Round-trip property | Examples |
|---|---|---|
| Lossless | Decoded = original, bit-for-bit | ZIP, PNG, FLAC |
| Lossy | Decoded โ original | JPEG, MP3, MP4 |
Concept Check ๐ฏ
Why Lossless Has A Limit
There's an information-theoretic minimum (entropy) for lossless compression. A truly random file cannot be losslessly compressed below its original size.
A simple example: a 100-character string of alternating "AB" patterns can be encoded as "AB ร 50" โ far smaller. A 100-character string of random characters has no pattern to exploit.
Why Lossy Can Go Further
Lossy compression exploits human perception:
- Audio: humans don't hear above ~20 kHz; some bands can be discarded.
- Image: small color shifts in textures are imperceptible.
- Video: most pixels barely change between frames; encode the difference.
A Tiny Lossless Example: Run-Length Encoding
Part 3: Patterns & Examples
๐๏ธ Data Compression & Analysis
Part 3 of 7 โ Patterns & Examples
Common Compression Patterns
| Pattern | Where used |
|---|---|
| Run-length encoding | Simple lossless; runs of repeats. |
| Dictionary coding (LZ77, LZW) | ZIP, GIF, PNG โ replace repeated substrings with references. |
| Huffman coding | Common values get short codes; rare values get long codes. |
| Frequency-domain transforms (DCT) | JPEG, MP3 โ keep big frequency components, discard small. |
| Differential encoding | Video โ encode changes between frames. |
Concept Check ๐ฏ
Dictionary Coding: A Mini-Walkthrough
Imagine encoding "the cat sat on the mat":
- Build a dictionary as you go.
- Replace repeated phrases with references.
After processing once, "the " might be code 0; "at" might be code 1. The compressed stream uses these short codes for repeated phrases.
This is the core idea behind ZIP, GZIP, PNG, and most modern lossless coders.
Huffman Coding In One Picture
If a text uses A 50%, B 25%, C 12.5%, D 12.5%:
Part 4: Connections & Interactions
๐๏ธ Data Compression & Analysis
Part 4 of 7 โ Connections & Interactions
Compression Connects To Other Topics
| Connection | Why |
|---|---|
| Compression โ Internet | Smaller files = faster downloads. |
| Compression โ Algorithms | Each codec is an algorithm. |
| Compression โ Data | Compression decisions affect what analysis is possible. |
| Compression โ Impact | Compression enables global media โ and surveillance archives. |
Concept Check ๐ฏ
Big Files Demand Streaming Algorithms
When a file doesn't fit in memory, you need algorithms that process data in a single pass with constant or sub-linear memory:
| Task | Streaming approach |
|---|---|
| Sum / mean | Running total; divide at the end. |
| Min / max | Compare each value to running extreme. |
| Count distinct (approximate) | HyperLogLog. |
| Top-K | Heap of size K. |
This connects compression-era data sizes to algorithm design.
Part 5: Change Over Time
๐๏ธ Data Compression & Analysis
Part 5 of 7 โ Change Over Time
How Compression Has Evolved
| Era | Defining codec |
|---|---|
| 1980s | RLE, LZW (used in GIF). |
| 1990s | JPEG for images, MP3 for audio, MPEG-2 for video. |
| 2000s | H.264 (ubiquitous video), AAC (audio). |
| 2010s | WebP, HEVC (H.265). |
| 2020s | AV1, AVIF โ open, royalty-free, even better quality-per-bit. |
Concept Check ๐ฏ
Codec Generations Roughly Halve File Sizes
| Codec | Era | Relative size for same quality |
|---|---|---|
| MPEG-2 | 1990s | 1ร |
| H.264 | 2000s | ~0.5ร |
| H.265 | 2010s | ~0.3ร |
| AV1 | 2020s | ~0.2ร |
Part 6: Problem-Solving Workshop
๐๏ธ Data Compression & Analysis
Part 6 of 7 โ Problem-Solving Workshop
Compression Workshop
Apply compression vocabulary to estimation and design problems.
Concept Check ๐ฏ
Worked: Image Sizing
| Format | 1024ร1024 24-bit |
|---|---|
| Raw RGB | 3 MB |
| PNG (lossless) | 1โ2 MB (depends on content) |
| JPEG (high quality) | ~300 KB |
| WebP / AVIF | ~150โ200 KB |
Worked: Audio Sizing
| Format | 60 s stereo |
|---|---|
| Raw 16-bit 44.1 kHz | ~10 MB |
| FLAC (lossless) | ~5โ7 MB |
| MP3 192 kbps | ~1.4 MB |
| Opus 64 kbps | ~480 KB |
Worked: Choosing A Format
| Use case | Best format |
|---|
Part 7: AP Review
๐๏ธ Data Compression & Analysis
Part 7 of 7 โ AP Review
AP Exam Recap โ Compression & Analysis
Final review.
Concept Check ๐ฏ
Final Vocab
| Term | Definition |
|---|---|
| Lossless | Decoded data = original. |
| Lossy | Decoded โ original. |
| Entropy | Information-theoretic minimum size for lossless coding. |
| RLE | Run-length encoding. |
| Dictionary coding | LZ-family; replace repeated substrings with references. |
| Huffman coding | Common symbols โ short codes. |
| DCT | Discrete cosine transform; basis of JPEG / MP3. |
| Differential encoding | Encode deltas (used in video). |
| Adaptive bitrate | Multiple quality levels for streaming. |
Common Pitfalls
- Saying "compression makes data smaller" without distinguishing lossless vs lossy.
- Encrypting before compressing.
- Treating storage as the only constraint (bandwidth often matters more).
- Using lossy formats for archival or medical records.
- Ignoring openness / licensing of codecs.