Residuals and Residual Plots

Assessing model fit

Residuals and Residual Plots

What are Residuals?

Residual: Vertical distance from point to regression line

$\text{residual} = y - \hat{y} = \text{observed} - \text{predicted}$

Positive residual: Point above line (prediction too low)
Negative residual: Point below line (prediction too high)

Sum of residuals = 0 (always, for least-squares line)

Residual Plot

Residual Plot: Scatterplot with x on horizontal axis, residuals on vertical axis

Purpose:

Check if linear model appropriate
Identify patterns suggesting problems
Detect outliers

Ideal Residual Plot

Good residual plot shows:

Random scatter around horizontal line at 0
No clear pattern (no curve, fan shape, etc.)
Constant variability across x-values
No outliers (points far from 0)

Interpretation: Linear model is appropriate

Patterns Indicating Problems

Curved pattern:

Linear model inappropriate
Relationship is nonlinear
Solution: Transform variables or use nonlinear model

Fan shape (increasing or decreasing spread):

Non-constant variance (heteroscedasticity)
Predictions less reliable for some x-values
Solution: Transform variables

Outliers:

Points far from horizontal band
Check for errors or unusual cases
Consider impact on regression line

Example 1: Good Residual Plot

Random scatter around 0, no pattern

Residuals randomly scattered above and below 0 with roughly constant spread.

Conclusion: Linear model appropriate

Example 2: Curved Residual Pattern

Residuals show U-shape or inverted U

Residuals form curved pattern (like parabola or inverted parabola).

Conclusion: Relationship is nonlinear, linear model inappropriate

Action: Try quadratic or other transformation

Example 3: Fan Shape

Spread increases (or decreases) with x

Residuals spread out more (or less) as x increases, forming fan or cone shape.

Conclusion: Non-constant variance

Action: May need transformation (e.g., log)

Standard Deviation of Residuals

Measures: Typical prediction error

$s = \sqrt{\frac{\sum(y - \hat{y})^2}{n-2}}$

Interpretation: "Typical distance of points from regression line is about s [y-units]."

Smaller s → better predictions (points closer to line)

Note: Denominator is n-2 (loses 2 df for slope and intercept)

Using s for Predictions

Rough prediction interval:

$\hat{y} \pm 2s$

Interpretation: About 95% of predictions within 2s of actual value

Example: $\hat{y}$ = 150 pounds, s = 10 pounds

Prediction interval ≈ 150 ± 20 = (130, 170) pounds

Outliers in Residual Plot

Outlier: Residual far from 0

Investigate:

Data entry error?
Unusual case?
Measurement error?

Impact:

Can affect regression line
May indicate different subgroup
Consider separate analysis with/without outlier

Checking Conditions for Regression

Use residual plot to check:

1. Linearity: Random scatter (no curve)

2. Equal variance: Constant spread across x-values

3. Independence: (Can't check from plot alone, depends on data collection)

4. Normality: (Check with histogram or normal probability plot of residuals)

Acronym: LINE (Linearity, Independence, Normality, Equal variance)

Histogram of Residuals

Purpose: Check if residuals approximately normal

Look for:

Roughly symmetric
Bell-shaped
No severe outliers

Note: Normality less critical for large samples (CLT)

Normal Probability Plot of Residuals

Purpose: Check normality of residuals

Good plot:

Points follow straight line
Little deviation from line

Bad plot:

Strong curvature
Many points far from line

Interpretation: If roughly linear, normality assumption reasonable

Influential Points

Identified in residual plot:

Large residual AND far from $\bar{x}$ in x-direction

Test influence:

Calculate regression with point
Calculate regression without point
Compare: Big change? Point is influential

Action: Report both analyses, investigate why point is unusual

Comparing Models

Use residual plots to compare different models:

Model 1 (linear): Residuals show pattern
Model 2 (quadratic): Residuals random scatter

Conclusion: Model 2 better (quadratic fits better than linear)

Also compare: Standard deviation of residuals (s)

Smaller s = better predictions

Calculator Methods

TI-83/84:

Get residuals:

Run LinReg (stores residuals automatically in RESID list)
2nd STAT (LIST) → RESID

Plot residuals:

STAT PLOT → Plot1
Type: Scatterplot
Xlist: L1, Ylist: RESID
ZOOM → 9:ZoomStat

Common Mistakes

❌ Not checking residual plot (just looking at r²)
❌ Using linear model when residuals show curve
❌ Ignoring fan shape in residuals
❌ Not investigating outliers
❌ Confusing residuals with errors

Residuals vs Errors

Residual: Observed - Predicted (y - $\hat{y}$ )

Calculated from sample
Can compute

Error: Observed - True (y - E(y))

Theoretical (unknown)
Can't compute (don't know true relationship)

Residuals estimate errors

Transformations

If residual plot shows problems:

For curvature:

Try log(y), √y, or x²
Re-fit model with transformed variable
Check new residual plot

For fan shape:

Try log(y) transformation
Stabilizes variance

Goal: Residuals with no pattern and constant spread

Quick Reference

Residual: $y - \hat{y}$

Good residual plot:

Random scatter around 0
No pattern
Constant spread

s: Typical prediction error

Check conditions: LINE (Linearity, Independence, Normality, Equal variance)

Problems to look for:

Curved pattern → nonlinear
Fan shape → non-constant variance
Outliers → investigate

Remember: Always examine residual plot! It reveals whether linear model is appropriate and highlights potential problems. Don't rely on correlation alone!

📚 Practice Problems

1Problem 1easy

❓ Question:

For regression ŷ = 10 + 2x, calculate the residual for the point (5, 25).

💡 Show Solution

Step 1: Identify actual value Point (5, 25): x = 5, y = 25 (actual)

Step 2: Calculate predicted value ŷ = 10 + 2(5) = 10 + 10 = 20

Step 3: Calculate residual Residual = y - ŷ Residual = 25 - 20 = 5

Step 4: Interpret The residual is POSITIVE (+5), meaning:

Actual value is ABOVE predicted value
Point is 5 units above the regression line
Model UNDERESTIMATES by 5 units

Answer: Residual = 5 (point is above the line)

2Problem 2medium

❓ Question:

A residual plot shows points scattered randomly around zero with no pattern. What does this indicate?

💡 Show Solution

Step 1: Understand what random scatter means Good residual plot characteristics: ✓ Points scattered RANDOMLY ✓ No curved, U-shaped, or other patterns ✓ Roughly equal spread at all x values ✓ Centered around residual = 0

Step 2: What this indicates The linear model is APPROPRIATE:

Linear relationship is valid (no curved pattern)
Constant variance (homoscedasticity)
No systematic errors
Independence assumption met

Step 3: What to do ✓ Can proceed with predictions ✓ Can trust confidence intervals ✓ Linear regression is validated

Answer: Random scatter indicates the linear model is APPROPRIATE. The relationship is truly linear, variance is constant, and there are no systematic errors.

3Problem 3medium

❓ Question:

A residual plot shows a curved (U-shaped) pattern. What does this suggest and what should you do?

💡 Show Solution

Step 1: Identify the problem U-shaped or curved residual plot means: Linear model is INAPPROPRIATE

The relationship is actually nonlinear (curved).

Step 2: Why this is a problem

Linear model makes systematic errors
Underestimates in middle, overestimates at extremes (or vice versa)
Predictions will be biased
Violates linearity assumption

Step 3: Solutions Option 1: Transform the data

Try log(y) vs x, or √y vs x
Replot residuals - should become random

Option 2: Use nonlinear regression

Quadratic: y = a + bx + cx²
Exponential: y = ae^(bx)

Step 4: Check new model After transformation, residual plot should show random scatter.

Answer: Curved residuals indicate NONLINEAR relationship. Transform variables (log, square root) or use nonlinear regression. Recheck residuals after adjustment.

4Problem 4medium

❓ Question:

For points (1,3), (2,5), (3,6) with regression ŷ = 2 + 1.5x, verify residuals sum to zero.

💡 Show Solution

Step 1: Calculate predicted values Point 1: ŷ₁ = 2 + 1.5(1) = 3.5 Point 2: ŷ₂ = 2 + 1.5(2) = 5 Point 3: ŷ₃ = 2 + 1.5(3) = 6.5

Step 2: Calculate residuals Residual = y - ŷ

Point 1: e₁ = 3 - 3.5 = -0.5 Point 2: e₂ = 5 - 5 = 0 Point 3: e₃ = 6 - 6.5 = -0.5

Step 3: Sum residuals Σ(residuals) = -0.5 + 0 + (-0.5) = -1.0

This is close to zero (small rounding error).

Step 4: Why residuals sum to zero Mathematical property: For least-squares regression, Σ(y - ŷ) = 0 ALWAYS

Guaranteed by the formulas
Positive and negative errors balance
Line goes through "middle" of data

Answer: Residuals sum to approximately 0. For true least-squares line, they ALWAYS sum exactly to zero.

5Problem 5hard

❓ Question:

A residual plot shows increasing spread (fan shape) as x increases. What does this violate and what are the implications?

💡 Show Solution

Step 1: Identify the violation Fan-shaped residuals violate: CONSTANT VARIANCE (homoscedasticity)

The spread increases with x (heteroscedasticity).

Step 2: Implications for predictions

Predictions less reliable at high x (wide spread)
Predictions more reliable at low x (tight spread)
Standard errors are WRONG
Confidence intervals misleading

Step 3: Implications for inference

t-tests may be invalid
p-values unreliable
Hypothesis tests have wrong error rates
Can't trust significance levels

Note: Estimates (slope, intercept) are still unbiased, but uncertainty measures are wrong.

Step 4: Solutions

Transform y (try log(y) or √y)
Use weighted least squares
Use robust standard errors
Report with caution

Answer: Violates CONSTANT VARIANCE assumption. Standard errors and confidence intervals unreliable. Solutions: transform y, use weighted least squares, or robust standard errors.

🎴

Practice with Flashcards

Review key concepts with our flashcard system

📖

Browse All Topics

Explore other calculus topics