Residuals and Residual Plots
Assessing model fit
Residuals and Residual Plots
What are Residuals?
Residual: Vertical distance from point to regression line
Positive residual: Point above line (prediction too low)
Negative residual: Point below line (prediction too high)
Sum of residuals = 0 (always, for least-squares line)
Residual Plot
Residual Plot: Scatterplot with x on horizontal axis, residuals on vertical axis
Purpose:
- Check if linear model appropriate
- Identify patterns suggesting problems
- Detect outliers
Ideal Residual Plot
Good residual plot shows:
- Random scatter around horizontal line at 0
- No clear pattern (no curve, fan shape, etc.)
- Constant variability across x-values
- No outliers (points far from 0)
Interpretation: Linear model is appropriate
Patterns Indicating Problems
Curved pattern:
- Linear model inappropriate
- Relationship is nonlinear
- Solution: Transform variables or use nonlinear model
Fan shape (increasing or decreasing spread):
- Non-constant variance (heteroscedasticity)
- Predictions less reliable for some x-values
- Solution: Transform variables
Outliers:
- Points far from horizontal band
- Check for errors or unusual cases
- Consider impact on regression line
Example 1: Good Residual Plot
Random scatter around 0, no pattern
Residuals randomly scattered above and below 0 with roughly constant spread.
Conclusion: Linear model appropriate
Example 2: Curved Residual Pattern
Residuals show U-shape or inverted U
Residuals form curved pattern (like parabola or inverted parabola).
Conclusion: Relationship is nonlinear, linear model inappropriate
Action: Try quadratic or other transformation
Example 3: Fan Shape
Spread increases (or decreases) with x
Residuals spread out more (or less) as x increases, forming fan or cone shape.
Conclusion: Non-constant variance
Action: May need transformation (e.g., log)
Standard Deviation of Residuals
Measures: Typical prediction error
Interpretation: "Typical distance of points from regression line is about s [y-units]."
Smaller s โ better predictions (points closer to line)
Note: Denominator is n-2 (loses 2 df for slope and intercept)
Using s for Predictions
Rough prediction interval:
Interpretation: About 95% of predictions within 2s of actual value
Example: = 150 pounds, s = 10 pounds
Prediction interval โ 150 ยฑ 20 = (130, 170) pounds
Outliers in Residual Plot
Outlier: Residual far from 0
Investigate:
- Data entry error?
- Unusual case?
- Measurement error?
Impact:
- Can affect regression line
- May indicate different subgroup
- Consider separate analysis with/without outlier
Checking Conditions for Regression
Use residual plot to check:
1. Linearity: Random scatter (no curve)
2. Equal variance: Constant spread across x-values
3. Independence: (Can't check from plot alone, depends on data collection)
4. Normality: (Check with histogram or normal probability plot of residuals)
Acronym: LINE (Linearity, Independence, Normality, Equal variance)
Histogram of Residuals
Purpose: Check if residuals approximately normal
Look for:
- Roughly symmetric
- Bell-shaped
- No severe outliers
Note: Normality less critical for large samples (CLT)
Normal Probability Plot of Residuals
Purpose: Check normality of residuals
Good plot:
- Points follow straight line
- Little deviation from line
Bad plot:
- Strong curvature
- Many points far from line
Interpretation: If roughly linear, normality assumption reasonable
Influential Points
Identified in residual plot:
- Large residual AND far from in x-direction
Test influence:
- Calculate regression with point
- Calculate regression without point
- Compare: Big change? Point is influential
Action: Report both analyses, investigate why point is unusual
Comparing Models
Use residual plots to compare different models:
Model 1 (linear): Residuals show pattern
Model 2 (quadratic): Residuals random scatter
Conclusion: Model 2 better (quadratic fits better than linear)
Also compare: Standard deviation of residuals (s)
- Smaller s = better predictions
Calculator Methods
TI-83/84:
Get residuals:
- Run LinReg (stores residuals automatically in RESID list)
- 2nd STAT (LIST) โ RESID
Plot residuals:
- STAT PLOT โ Plot1
- Type: Scatterplot
- Xlist: L1, Ylist: RESID
- ZOOM โ 9:ZoomStat
Common Mistakes
โ Not checking residual plot (just looking at rยฒ)
โ Using linear model when residuals show curve
โ Ignoring fan shape in residuals
โ Not investigating outliers
โ Confusing residuals with errors
Residuals vs Errors
Residual: Observed - Predicted (y - )
- Calculated from sample
- Can compute
Error: Observed - True (y - E(y))
- Theoretical (unknown)
- Can't compute (don't know true relationship)
Residuals estimate errors
Transformations
If residual plot shows problems:
For curvature:
- Try log(y), โy, or xยฒ
- Re-fit model with transformed variable
- Check new residual plot
For fan shape:
- Try log(y) transformation
- Stabilizes variance
Goal: Residuals with no pattern and constant spread
Quick Reference
Residual:
Good residual plot:
- Random scatter around 0
- No pattern
- Constant spread
s: Typical prediction error
Check conditions: LINE (Linearity, Independence, Normality, Equal variance)
Problems to look for:
- Curved pattern โ nonlinear
- Fan shape โ non-constant variance
- Outliers โ investigate
Remember: Always examine residual plot! It reveals whether linear model is appropriate and highlights potential problems. Don't rely on correlation alone!
๐ Practice Problems
1Problem 1easy
โ Question:
For regression ลท = 10 + 2x, calculate the residual for the point (5, 25).
๐ก Show Solution
Step 1: Identify actual value Point (5, 25): x = 5, y = 25 (actual)
Step 2: Calculate predicted value ลท = 10 + 2(5) = 10 + 10 = 20
Step 3: Calculate residual Residual = y - ลท Residual = 25 - 20 = 5
Step 4: Interpret The residual is POSITIVE (+5), meaning:
- Actual value is ABOVE predicted value
- Point is 5 units above the regression line
- Model UNDERESTIMATES by 5 units
Answer: Residual = 5 (point is above the line)
2Problem 2medium
โ Question:
A residual plot shows points scattered randomly around zero with no pattern. What does this indicate?
๐ก Show Solution
Step 1: Understand what random scatter means Good residual plot characteristics: โ Points scattered RANDOMLY โ No curved, U-shaped, or other patterns โ Roughly equal spread at all x values โ Centered around residual = 0
Step 2: What this indicates The linear model is APPROPRIATE:
- Linear relationship is valid (no curved pattern)
- Constant variance (homoscedasticity)
- No systematic errors
- Independence assumption met
Step 3: What to do โ Can proceed with predictions โ Can trust confidence intervals โ Linear regression is validated
Answer: Random scatter indicates the linear model is APPROPRIATE. The relationship is truly linear, variance is constant, and there are no systematic errors.
3Problem 3medium
โ Question:
A residual plot shows a curved (U-shaped) pattern. What does this suggest and what should you do?
๐ก Show Solution
Step 1: Identify the problem U-shaped or curved residual plot means: Linear model is INAPPROPRIATE
The relationship is actually nonlinear (curved).
Step 2: Why this is a problem
- Linear model makes systematic errors
- Underestimates in middle, overestimates at extremes (or vice versa)
- Predictions will be biased
- Violates linearity assumption
Step 3: Solutions Option 1: Transform the data
- Try log(y) vs x, or โy vs x
- Replot residuals - should become random
Option 2: Use nonlinear regression
- Quadratic: y = a + bx + cxยฒ
- Exponential: y = ae^(bx)
Step 4: Check new model After transformation, residual plot should show random scatter.
Answer: Curved residuals indicate NONLINEAR relationship. Transform variables (log, square root) or use nonlinear regression. Recheck residuals after adjustment.
4Problem 4medium
โ Question:
For points (1,3), (2,5), (3,6) with regression ลท = 2 + 1.5x, verify residuals sum to zero.
๐ก Show Solution
Step 1: Calculate predicted values Point 1: ลทโ = 2 + 1.5(1) = 3.5 Point 2: ลทโ = 2 + 1.5(2) = 5 Point 3: ลทโ = 2 + 1.5(3) = 6.5
Step 2: Calculate residuals Residual = y - ลท
Point 1: eโ = 3 - 3.5 = -0.5 Point 2: eโ = 5 - 5 = 0 Point 3: eโ = 6 - 6.5 = -0.5
Step 3: Sum residuals ฮฃ(residuals) = -0.5 + 0 + (-0.5) = -1.0
This is close to zero (small rounding error).
Step 4: Why residuals sum to zero Mathematical property: For least-squares regression, ฮฃ(y - ลท) = 0 ALWAYS
- Guaranteed by the formulas
- Positive and negative errors balance
- Line goes through "middle" of data
Answer: Residuals sum to approximately 0. For true least-squares line, they ALWAYS sum exactly to zero.
5Problem 5hard
โ Question:
A residual plot shows increasing spread (fan shape) as x increases. What does this violate and what are the implications?
๐ก Show Solution
Step 1: Identify the violation Fan-shaped residuals violate: CONSTANT VARIANCE (homoscedasticity)
The spread increases with x (heteroscedasticity).
Step 2: Implications for predictions
- Predictions less reliable at high x (wide spread)
- Predictions more reliable at low x (tight spread)
- Standard errors are WRONG
- Confidence intervals misleading
Step 3: Implications for inference
- t-tests may be invalid
- p-values unreliable
- Hypothesis tests have wrong error rates
- Can't trust significance levels
Note: Estimates (slope, intercept) are still unbiased, but uncertainty measures are wrong.
Step 4: Solutions
- Transform y (try log(y) or โy)
- Use weighted least squares
- Use robust standard errors
- Report with caution
Answer: Violates CONSTANT VARIANCE assumption. Standard errors and confidence intervals unreliable. Solutions: transform y, use weighted least squares, or robust standard errors.
Practice with Flashcards
Review key concepts with our flashcard system
Browse All Topics
Explore other calculus topics