Coefficient of Determination - Complete Interactive Lesson
Part 1: Scatterplots and Correlation
📈 Scatterplots and Correlation
Part 1 of 7 — Exploring Bivariate Relationships
Describing Scatterplots
When examining a scatterplot, describe:
Feature
Options
Direction
Positive, negative, or none
Form
Linear, curved, or no pattern
Strength
Strong, moderate, or weak
Outliers
Any points that don't fit the pattern
Correlation Coefficient r
The correlation r measures the strength and direction of a linear relationship:
r=n−11∑(
Value of r
Interpretation
r=1
Perfect positive linear
r=−1
Perfect negative linear
Important Properties of r
−1≤r≤1 always
r has no units
r is not affected by changes in units (e.g., inches to cm)
r measures only linear association — a strong curved relationship can have
🔑 Correlation does NOT imply causation. A strong correlation between two variables does not mean one causes the other.
Correlation Check 🎯
Correlation Practice 🧮
1) If r=0.72, what is r2? (Round to 2 decimal places)
2) What percentage of variation in y is explained by the linear relationship with x? (Use from #1, express as a whole number)
Part 2: Least-Squares Regression Line
📊 Least-Squares Regression Line
Part 2 of 7 — The LSRL
Topics in This Part
Section
📐 What the LSRL Minimizes
🧮 The Equation & Slope/Intercept
📝 Interpreting Slope and Intercept
📊 Predictions & Extrapolation
🔑 Key Concept: The least-squares regression line (LSRL) is the line that minimizes the sum of the squared residuals — the best-fit line through a scatterplot.
The LSRL Equation
y
Part 3: Residuals and Residual Plots
📊 Residuals and Residual Plots
Part 3 of 7 — Assessing the Fit of a Linear Model
Topics in This Part
Section
📐 What Is a Residual?
📊 Residual Plots
✅ Good vs. Bad Patterns
📝 Worked Example
🔑 Key Concept: A residual is the vertical distance from a data point to the regression line. Residual plots help us assess whether a linear model is appropriate.
Residual Formula
e
Part 4: Coefficient of Determination
📊 Coefficient of Determination
Part 4 of 7 — Understanding r2
Topics in This Part
Section
📐 What r2 Measures
🧮 Calculating from
Part 5: Influential Points and Outliers
📊 Influential Points and Outliers
Part 5 of 7 — Leverage, Influence, and Unusual Observations
Topics in This Part
Section
⚠️ Outliers in Regression
📐 High-Leverage Points
🔄 Influential Points
🧪 Diagnosing Unusual Points
🔑 Key Concept: Not all unusual points are equally problematic. Some change the regression line dramatically (influential), while others are just far from the pattern (outliers).
Three Types of Unusual Points
1. Outlier (in y-direction)
A point whose y-value is far from the predicted (large residual)
Part 6: Problem-Solving Workshop
📊 Problem-Solving Workshop
Part 6 of 7 — Full Regression Analysis Problems
Workshop Goals
Skill
📐 Compute and interpret the LSRL
📝 Interpret slope, intercept, r, and r2 in context
📉 Analyze residuals and residual plots
⚠️ Identify unusual/influential points
🎯 Recognize the limits of the model
🔑 AP Tip: Free-response regression questions typically ask you to interpret slope/ in context, describe the residual plot, and discuss whether the model is appropriate.
Part 7: Review & Applications
📊 Review & Applications
Part 7 of 7 — Comprehensive Linear Regression Review
Complete Formula Reference
Concept
Formula
LSRL
y^=a+bx
Slope
sxxi−xˉ
)
(syyi−yˉ)
r=0
No linear relationship
$0.8 \leq
r
$0.5 \leq
r
r≈0
r is sensitive to outliers
r2
3) If every data point falls exactly on the line y=3x+2, then r=?
^
=
a
+
b
x
where:
b=r⋅sxsy (slope)
a=yˉ−bxˉ (intercept)
The line always passes through the point (xˉ,yˉ)
What LSRL Minimizes
LSRL minimizes ∑(yi−y^i)2=∑ei2
This is the sum of squared residuals — hence "least squares."
Interpreting Slope
Template: "For each additional [1 unit of x], the predicted [y variable] changes by [b units], on average."
Example:y^=12+3.5x where x = hours studied, y = exam score.
✅ "For each additional hour studied, the predicted exam score increases by 3.5 points, on average."
⚠️ AP Tip: Include "predicted" and "on average" for full credit.
Interpreting Intercept
Template: "When x=0, the predicted [y variable] is [a]."
Example:a=12 in y^=12+3.5x.
✅ "When a student studies 0 hours, the predicted exam score is 12 points."
⚠️ Caution: The intercept often has no practical meaning (e.g., studying 0 hours). State the interpretation but note if x=0 is outside the data range.
Predictions & Extrapolation
Term
Definition
Interpolation
Predicting within the range of observed x values ✓
Extrapolation
Predicting outside the range of observed x values ⚠️
⚠️ Extrapolation is unreliable. The linear relationship may not hold outside the data range.
LSRL Concepts 🎯
LSRL Calculations 🧮
Given: xˉ=10, yˉ=25, sx=4, sy=8, r=0.85.
1) What is the slope b?
2) What is the intercept a?
3) What is y^ when x=12?
Interpretation Practice 🔍
y^=50−0.8x where x = temperature (°F), y = hot chocolate sales.
Exit Quiz — LSRL ✅
i
=
yi
−
y^i
=
observed
−
predicted
Sign
Meaning
e>0
Point is above the line — model underestimates
e<0
Point is below the line — model overestimates
e=0
Point is exactly on the line
Properties of Residuals
∑ei=0 (residuals always sum to zero)
The mean of residuals = 0
∑ei2 is minimized by the LSRL
Residual Plots
A residual plot plots residuals (e) on the y-axis vs. the explanatory variable (x) or fitted values (y^) on the x-axis.
Reading Residual Plots
Pattern
Interpretation
Random scatter around e=0
✅ Linear model is appropriate
Curved pattern (U or ∩)
❌ Relationship is nonlinear — use a transformation
Fan shape (spread changes)
❌ Non-constant variance — predictions are less reliable at some x values
Outliers
⚠️ Individual points far from e=0 — investigate
🔑 AP Tip: The residual plot is your most important diagnostic tool. ALWAYS examine it before trusting a regression.
Worked Example
y^=10+2x. Data point: (5,23).
y^=10+2(5)=20e=23−20=3
The residual is +3: the observed value is 3 units above the predicted value.
Residual Concepts 🎯
Residual Calculations 🧮
LSRL: y^=15+4x
1) Point (3,30). Residual e=
2) Point (5,33). Residual e=
3) Point (2,23). The model ___ (enter "overestimates" or "underestimates").
Residual Plot Patterns 🔍
Exit Quiz — Residuals ✅
r2
r
📝 Interpreting r2 on the AP Exam
🔗 r vs. r2
🔑 Key Concept:r2 tells you the fraction of variability in y that is explained by the linear relationship with x.
The Definition
r2=SSTSSR=1−SSTSSE
where:
SST = total sum of squares = ∑(yi−yˉ)2 (total variability in y)
SSE = sum of squared errors = ∑(yi−y^i) (unexplained variability)
SSR = regression sum of squares = SST − SSE (explained variability)
Or simply: r2=r×r (square the correlation coefficient).
Interpretation Template
"[r2×100]% of the variability in [y context] is explained by the linear relationship with [x context]."
Example:r2=0.72, x = hours studied, y = exam score.
✅ "72% of the variability in exam scores is explained by the linear relationship with hours studied."
⚠️ AP Tip: Always say "variability in [y]" and "linear relationship with [x]." Do not say "caused by" or "due to."
r vs. r2
Statistic
Measures
Range
r
Direction and strength of linear relationship
−1≤r≤1
r2
Proportion of variability explained
0≤r2≤1
r
r2
Strength
±0.9
0.81
Strong
±0.7
0.49
Moderate
±0.5
0.25
Weak
±0.3
0.09
Very weak
🔑 Key Insight: Even a "moderate" r=0.7 only explains 49% of the variability. Much variation remains unexplained.
r2 Concepts 🎯
r2 Calculations 🧮
1)r=0.9. What is r2?
2)r2=0.49. What percentage of variability is explained?
3) SST = 500, SSE = 125. What is r2?
Interpretation Practice 🔍
Exit Quiz — r2 ✅
y^
Has an unusually large ∣residual∣
Does NOT necessarily change the regression line much
2. High-Leverage Point (in x-direction)
A point whose x-value is far from xˉ
Has the potential to influence the regression line
May or may not actually change the line — depends on where it falls
3. Influential Point
A point that, when removed, substantially changes the slope, intercept, or r2
High-leverage points that are also outliers are the most influential
Test: Fit the LSRL with and without the point. If slope/intercept/r2 changes a lot, the point is influential.
Visualizing the Distinction
Scenario
Large Residual?
Far from xˉ?
Influential?
Regular point near center
No
No
No
Outlier near center of x
Yes
No
Usually no
Point at extreme x, on the line
No
Yes
Usually no
Point at extreme x, off the line
Yes
Yes
Yes
Worked Example
A researcher collects data on advertising spending (x, in thousands) and sales (y, in thousands) for 10 stores:
Most stores spend $2K–$8K. One store spent $25K (high leverage).
Scenario A: That store had $50K in sales, fitting the overall pattern → high leverage but NOT influential.
Scenario B: That store had $5K in sales, far below the trend → high leverage AND influential. Removing it would substantially change the slope.
⚠️ AP Tip: On the AP exam, "influential" specifically means removing the point changes the regression equation meaningfully. Always describe the effect on slope, intercept, or r2.
What to Do with Unusual Points
Investigate — is there a data-entry error or special circumstance?
Report both analyses — with and without the point
Never silently delete data — explain your reasoning
Check the residual plot — unusual points often show up clearly
Identifying Unusual Points 🎯
Diagnosing Points 🧮
1) The LSRL is y^=10+3x. A point has x=5,y=35. What is the residual?
2)xˉ=12. A point has x=45. Is this point high-leverage? (yes/no)
3) With all points: slope =1.8. Without point A: slope =1.7. Without point B: slope =4.5. Which point is more influential? (A/B)
Leverage and Influence Concepts 🔍
Exit Quiz — Influential Points & Outliers ✅
r2
Worked Example 1 — Temperature and Ice Cream Sales
A manager records daily high temperature (x, °F) and ice cream sales (y, $100s) for 15 summer days.
Computer output:
Predictor
Coef
SE Coef
T
P
Constant
−3.50
1.12
−3.13
0.008
Temperature
0.15
0.013
11.54
<0.001
S=0.96R-sq=91.1%
Step 1 — Write the LSRL:y^=−3.50+0.15x
Step 2 — Interpret the slope:
"For each additional degree Fahrenheit increase in daily high temperature, the predicted ice cream sales increase by $15 (0.15 hundreds)."
Step 3 — Interpret r2:
"91.1% of the variability in ice cream sales is explained by the linear relationship with daily high temperature."
Step 4 — Predict: At x=85°F:
y^=−3.50+0.15(85)=−3.50+12.75=9.25 ($925 in sales)
Step 5 — Check appropriateness:
Residual plot shows no obvious pattern → linear model is appropriate
r2=0.911 → strong linear fit
No influential points observed in the residual plot
Worked Example 2 — Study Hours and GPA
A sample of 30 college students. x = weekly study hours, y = GPA.
LSRL: y^=1.85+0.052x, r=0.68, r2=0.462
One student studies 42 hours/week (most study 5–25 hours) and has a GPA of 3.9.
Analysis:
Slope interpretation: "For each additional hour of weekly studying, GPA is predicted to increase by 0.052 points."
r2 interpretation: "46.2% of the variability in GPA is explained by the linear relationship with weekly study hours."
The 42-hour student:
y^=1.85+0.052(42)=4.034 — predicted GPA is 4.034
Residual =3.9−4.034=−0.134 — small residual
x=42 is far from xˉ → high leverage
But residual is small → likely not influential (on the trend line)
Prediction for 50 hours:y^=1.85+0.052(50)=4.45
This is extrapolation (beyond data range) and the prediction exceeds 4.0 (max GPA) — unreliable!
Common Mistakes on the AP Exam
Mistake
Correction
"Temperature causes sales to increase"
Use "is associated with" or "predicts"
"91.1% of the data falls on the line"
"r2 measures variability explained, not % of points on the line"
Interpreting the intercept literally when x=0 is outside the data
"The intercept has no practical interpretation because x=0 is outside the range of data"
Forgetting units in slope interpretation
"For each additional [unit of x], [y] is predicted to [increase/decrease] by [slope] [units of y]"
Regression Analysis Practice 🎯
Computations 🧮
LSRL: y^=5.2+1.3x, r=0.85
1) Predict y when x=10.
2) What is r2? (two decimal places)
3) Observed y=22 when x=10. What is the residual?
Interpretation Decisions 🔍
Exit Quiz — Regression Workshop ✅
b=r⋅sxsy
Intercept
a=yˉ−bxˉ
Correlation
r=n−11∑(sxxi−xˉ)(syyi−yˉ)
r2
r2=1−SSTSSE
Residual
ei=yi−y^i
Interpretation Templates (AP Exam Ready)
Slope: "For each additional [1 unit of x], the predicted [y in context] [increases/decreases] by [|b|] [units of y]."
Intercept: "When [x in context] is 0, the predicted [y in context] is [a] [units of y]." (Only if x=0 is in the data range.)
r: "There is a [strong/moderate/weak], [positive/negative], linear association between [x] and [y]."
r2: "[r2×100]% of the variability in [y in context] is explained by the linear relationship with [x in context]."
Residual: "The actual [y in context] was [e] [units] [above/below] the value predicted by the model."
Key Concepts Summary
Topic
Key Takeaway
Scatterplot
Always plot data first; describe direction, form, strength, unusual features
LSRL
Minimizes ∑ei2; passes through (xˉ,yˉ); ∑ei=0
Slope & Intercept
Slope = rate of change; intercept = starting value (if meaningful)
Residuals
e=y−y^; residual plot checks model appropriateness
r
Direction + strength; −1≤r≤1; only for linear relationships
r2
Proportion of variability explained; 0≤r2≤1
Outliers
Large residual; may or may not be influential
High Leverage
Extreme x-value; potential to influence
Influential
Removing changes slope/r2 substantially
Extrapolation
Predicting outside data range — unreliable
Decision Guide
Is the relationship linear?↓Check scatterplot → Fit LSRL → Check residual plot↓Random scatter → Linear model OKCurved pattern → Nonlinear model needed↓Interpret: slope, r,r2 in context↓Check for unusual points↓Make predictions (within data range only)
🔑 AP Exam Strategy: Regression appears on the exam every year. Master the interpretation templates — they earn you full credit on free-response questions.