Residuals and Residual Plots

Assessing model fit

Residuals and Residual Plots

What are Residuals?

Residual: Vertical distance from point to regression line

residual=yโˆ’y^=observedโˆ’predicted\text{residual} = y - \hat{y} = \text{observed} - \text{predicted}

Positive residual: Point above line (prediction too low)
Negative residual: Point below line (prediction too high)

Sum of residuals = 0 (always, for least-squares line)

Residual Plot

Residual Plot: Scatterplot with x on horizontal axis, residuals on vertical axis

Purpose:

  • Check if linear model appropriate
  • Identify patterns suggesting problems
  • Detect outliers

Ideal Residual Plot

Good residual plot shows:

  1. Random scatter around horizontal line at 0
  2. No clear pattern (no curve, fan shape, etc.)
  3. Constant variability across x-values
  4. No outliers (points far from 0)

Interpretation: Linear model is appropriate

Patterns Indicating Problems

Curved pattern:

  • Linear model inappropriate
  • Relationship is nonlinear
  • Solution: Transform variables or use nonlinear model

Fan shape (increasing or decreasing spread):

  • Non-constant variance (heteroscedasticity)
  • Predictions less reliable for some x-values
  • Solution: Transform variables

Outliers:

  • Points far from horizontal band
  • Check for errors or unusual cases
  • Consider impact on regression line

Example 1: Good Residual Plot

Random scatter around 0, no pattern

Residuals randomly scattered above and below 0 with roughly constant spread.

Conclusion: Linear model appropriate

Example 2: Curved Residual Pattern

Residuals show U-shape or inverted U

Residuals form curved pattern (like parabola or inverted parabola).

Conclusion: Relationship is nonlinear, linear model inappropriate

Action: Try quadratic or other transformation

Example 3: Fan Shape

Spread increases (or decreases) with x

Residuals spread out more (or less) as x increases, forming fan or cone shape.

Conclusion: Non-constant variance

Action: May need transformation (e.g., log)

Standard Deviation of Residuals

Measures: Typical prediction error

s=โˆ‘(yโˆ’y^)2nโˆ’2s = \sqrt{\frac{\sum(y - \hat{y})^2}{n-2}}

Interpretation: "Typical distance of points from regression line is about s [y-units]."

Smaller s โ†’ better predictions (points closer to line)

Note: Denominator is n-2 (loses 2 df for slope and intercept)

Using s for Predictions

Rough prediction interval:

y^ยฑ2s\hat{y} \pm 2s

Interpretation: About 95% of predictions within 2s of actual value

Example: y^\hat{y} = 150 pounds, s = 10 pounds

Prediction interval โ‰ˆ 150 ยฑ 20 = (130, 170) pounds

Outliers in Residual Plot

Outlier: Residual far from 0

Investigate:

  • Data entry error?
  • Unusual case?
  • Measurement error?

Impact:

  • Can affect regression line
  • May indicate different subgroup
  • Consider separate analysis with/without outlier

Checking Conditions for Regression

Use residual plot to check:

1. Linearity: Random scatter (no curve)

2. Equal variance: Constant spread across x-values

3. Independence: (Can't check from plot alone, depends on data collection)

4. Normality: (Check with histogram or normal probability plot of residuals)

Acronym: LINE (Linearity, Independence, Normality, Equal variance)

Histogram of Residuals

Purpose: Check if residuals approximately normal

Look for:

  • Roughly symmetric
  • Bell-shaped
  • No severe outliers

Note: Normality less critical for large samples (CLT)

Normal Probability Plot of Residuals

Purpose: Check normality of residuals

Good plot:

  • Points follow straight line
  • Little deviation from line

Bad plot:

  • Strong curvature
  • Many points far from line

Interpretation: If roughly linear, normality assumption reasonable

Influential Points

Identified in residual plot:

  • Large residual AND far from xห‰\bar{x} in x-direction

Test influence:

  1. Calculate regression with point
  2. Calculate regression without point
  3. Compare: Big change? Point is influential

Action: Report both analyses, investigate why point is unusual

Comparing Models

Use residual plots to compare different models:

Model 1 (linear): Residuals show pattern
Model 2 (quadratic): Residuals random scatter

Conclusion: Model 2 better (quadratic fits better than linear)

Also compare: Standard deviation of residuals (s)

  • Smaller s = better predictions

Calculator Methods

TI-83/84:

Get residuals:

  1. Run LinReg (stores residuals automatically in RESID list)
  2. 2nd STAT (LIST) โ†’ RESID

Plot residuals:

  1. STAT PLOT โ†’ Plot1
  2. Type: Scatterplot
  3. Xlist: L1, Ylist: RESID
  4. ZOOM โ†’ 9:ZoomStat

Common Mistakes

โŒ Not checking residual plot (just looking at rยฒ)
โŒ Using linear model when residuals show curve
โŒ Ignoring fan shape in residuals
โŒ Not investigating outliers
โŒ Confusing residuals with errors

Residuals vs Errors

Residual: Observed - Predicted (y - y^\hat{y})

  • Calculated from sample
  • Can compute

Error: Observed - True (y - E(y))

  • Theoretical (unknown)
  • Can't compute (don't know true relationship)

Residuals estimate errors

Transformations

If residual plot shows problems:

For curvature:

  • Try log(y), โˆšy, or xยฒ
  • Re-fit model with transformed variable
  • Check new residual plot

For fan shape:

  • Try log(y) transformation
  • Stabilizes variance

Goal: Residuals with no pattern and constant spread

Quick Reference

Residual: yโˆ’y^y - \hat{y}

Good residual plot:

  • Random scatter around 0
  • No pattern
  • Constant spread

s: Typical prediction error

Check conditions: LINE (Linearity, Independence, Normality, Equal variance)

Problems to look for:

  • Curved pattern โ†’ nonlinear
  • Fan shape โ†’ non-constant variance
  • Outliers โ†’ investigate

Remember: Always examine residual plot! It reveals whether linear model is appropriate and highlights potential problems. Don't rely on correlation alone!

๐Ÿ“š Practice Problems

1Problem 1easy

โ“ Question:

For regression ลท = 10 + 2x, calculate the residual for the point (5, 25).

๐Ÿ’ก Show Solution

Step 1: Identify actual value Point (5, 25): x = 5, y = 25 (actual)

Step 2: Calculate predicted value ลท = 10 + 2(5) = 10 + 10 = 20

Step 3: Calculate residual Residual = y - ลท Residual = 25 - 20 = 5

Step 4: Interpret The residual is POSITIVE (+5), meaning:

  • Actual value is ABOVE predicted value
  • Point is 5 units above the regression line
  • Model UNDERESTIMATES by 5 units

Answer: Residual = 5 (point is above the line)

2Problem 2medium

โ“ Question:

A residual plot shows points scattered randomly around zero with no pattern. What does this indicate?

๐Ÿ’ก Show Solution

Step 1: Understand what random scatter means Good residual plot characteristics: โœ“ Points scattered RANDOMLY โœ“ No curved, U-shaped, or other patterns โœ“ Roughly equal spread at all x values โœ“ Centered around residual = 0

Step 2: What this indicates The linear model is APPROPRIATE:

  1. Linear relationship is valid (no curved pattern)
  2. Constant variance (homoscedasticity)
  3. No systematic errors
  4. Independence assumption met

Step 3: What to do โœ“ Can proceed with predictions โœ“ Can trust confidence intervals โœ“ Linear regression is validated

Answer: Random scatter indicates the linear model is APPROPRIATE. The relationship is truly linear, variance is constant, and there are no systematic errors.

3Problem 3medium

โ“ Question:

A residual plot shows a curved (U-shaped) pattern. What does this suggest and what should you do?

๐Ÿ’ก Show Solution

Step 1: Identify the problem U-shaped or curved residual plot means: Linear model is INAPPROPRIATE

The relationship is actually nonlinear (curved).

Step 2: Why this is a problem

  • Linear model makes systematic errors
  • Underestimates in middle, overestimates at extremes (or vice versa)
  • Predictions will be biased
  • Violates linearity assumption

Step 3: Solutions Option 1: Transform the data

  • Try log(y) vs x, or โˆšy vs x
  • Replot residuals - should become random

Option 2: Use nonlinear regression

  • Quadratic: y = a + bx + cxยฒ
  • Exponential: y = ae^(bx)

Step 4: Check new model After transformation, residual plot should show random scatter.

Answer: Curved residuals indicate NONLINEAR relationship. Transform variables (log, square root) or use nonlinear regression. Recheck residuals after adjustment.

4Problem 4medium

โ“ Question:

For points (1,3), (2,5), (3,6) with regression ลท = 2 + 1.5x, verify residuals sum to zero.

๐Ÿ’ก Show Solution

Step 1: Calculate predicted values Point 1: ลทโ‚ = 2 + 1.5(1) = 3.5 Point 2: ลทโ‚‚ = 2 + 1.5(2) = 5 Point 3: ลทโ‚ƒ = 2 + 1.5(3) = 6.5

Step 2: Calculate residuals Residual = y - ลท

Point 1: eโ‚ = 3 - 3.5 = -0.5 Point 2: eโ‚‚ = 5 - 5 = 0 Point 3: eโ‚ƒ = 6 - 6.5 = -0.5

Step 3: Sum residuals ฮฃ(residuals) = -0.5 + 0 + (-0.5) = -1.0

This is close to zero (small rounding error).

Step 4: Why residuals sum to zero Mathematical property: For least-squares regression, ฮฃ(y - ลท) = 0 ALWAYS

  • Guaranteed by the formulas
  • Positive and negative errors balance
  • Line goes through "middle" of data

Answer: Residuals sum to approximately 0. For true least-squares line, they ALWAYS sum exactly to zero.

5Problem 5hard

โ“ Question:

A residual plot shows increasing spread (fan shape) as x increases. What does this violate and what are the implications?

๐Ÿ’ก Show Solution

Step 1: Identify the violation Fan-shaped residuals violate: CONSTANT VARIANCE (homoscedasticity)

The spread increases with x (heteroscedasticity).

Step 2: Implications for predictions

  • Predictions less reliable at high x (wide spread)
  • Predictions more reliable at low x (tight spread)
  • Standard errors are WRONG
  • Confidence intervals misleading

Step 3: Implications for inference

  • t-tests may be invalid
  • p-values unreliable
  • Hypothesis tests have wrong error rates
  • Can't trust significance levels

Note: Estimates (slope, intercept) are still unbiased, but uncertainty measures are wrong.

Step 4: Solutions

  • Transform y (try log(y) or โˆšy)
  • Use weighted least squares
  • Use robust standard errors
  • Report with caution

Answer: Violates CONSTANT VARIANCE assumption. Standard errors and confidence intervals unreliable. Solutions: transform y, use weighted least squares, or robust standard errors.