Least-Squares Regression

Finding the line of best fit

Least-Squares Regression

Regression Line

Purpose: Find best-fit line through scatterplot

Equation:

y^=a+bx\hat{y} = a + bx

Where:

  • y^\hat{y} = predicted value of y
  • b = slope
  • a = y-intercept
  • x = value of explanatory variable

Least-Squares Criterion

Least-squares regression line: Line minimizing sum of squared residuals

Residual: Difference between observed and predicted

residual=yy^\text{residual} = y - \hat{y}

Least-squares minimizes: (yy^)2\sum (y - \hat{y})^2

Why square? Positive and negative deviations don't cancel

Formulas for Slope and Intercept

Slope:

b=rsysxb = r \frac{s_y}{s_x}

Where:

  • r = correlation
  • s_y = standard deviation of y
  • s_x = standard deviation of x

y-intercept:

a=yˉbxˉa = \bar{y} - b\bar{x}

Key insight: Line always passes through (xˉ,yˉ)(\bar{x}, \bar{y})

Example: Finding Regression Line

Data: Height (x) and weight (y) of 5 people

xˉ=68\bar{x} = 68, sx=4s_x = 4
yˉ=150\bar{y} = 150, sy=20s_y = 20
r=0.8r = 0.8

Slope:

b=0.8×204=0.8×5=4b = 0.8 \times \frac{20}{4} = 0.8 \times 5 = 4

Intercept:

a=1504(68)=150272=122a = 150 - 4(68) = 150 - 272 = -122

Equation:

y^=122+4x\hat{y} = -122 + 4x

Interpretation: For each inch increase in height, predicted weight increases by 4 pounds.

Interpreting Slope

Slope b = change in y^\hat{y} per unit increase in x

Template: "For each [1 unit] increase in [x], predicted [y] [increases/decreases] by [|b|] [y-units]."

Example: b = 4 in height/weight

"For each 1-inch increase in height, predicted weight increases by 4 pounds."

Negative slope: "decreases by..."

Interpreting y-Intercept

y-intercept a = predicted y when x = 0

Often meaningless!

  • Height = 0 → weight = -122 pounds? Nonsense!

Only interpret if x = 0 is meaningful and within data range

Example where meaningful:

  • y = test score, x = hours studied
  • a = predicted score with 0 hours studying

Making Predictions

Substitute x into equation:

y^=a+bx\hat{y} = a + bx

Example: Predict weight for height = 70 inches

y^=122+4(70)=122+280=158 pounds\hat{y} = -122 + 4(70) = -122 + 280 = 158 \ \text{pounds}

Caution: Extrapolation (predicting outside data range) is risky!

Extrapolation

Interpolation: Predict within range of data ✓

Extrapolation: Predict outside range of data ⚠

Problem with extrapolation:

  • Relationship may not continue
  • May become nonlinear
  • Other factors may matter

Example: Predicting weight for height = 100 inches

  • Well outside typical range
  • Relationship might not hold
  • Prediction unreliable

Calculator Method

TI-83/84:

  1. Enter data in L1 (x) and L2 (y)
  2. STAT → CALC → 8:LinReg(a+bx)
  3. Read a, b, r, r²

Result shows:

  • y = a + bx
  • r (correlation)
  • r² (coefficient of determination)

Properties of Regression Line

1. Passes through (xˉ\bar{x}, yˉ\bar{y})

2. Sum of residuals = 0

  • Positive and negative balance out

3. Unique (only one least-squares line)

4. Sensitive to outliers

  • One outlier can drastically change line

Residuals

Residual = observed - predicted = y - y^\hat{y}

Positive residual: Point above line (underestimate)
Negative residual: Point below line (overestimate)
Zero residual: Point on line (exact prediction)

Example: Actual weight = 160, predicted = 158

  • Residual = 160 - 158 = 2 pounds
  • Underestimated by 2 pounds

Influential Points

Influential point: Removing it substantially changes regression line

Usually:

  • Outliers in x-direction (far from xˉ\bar{x})
  • Have high leverage (pull line toward them)

Not all outliers are influential!

  • Outlier in y-direction but near xˉ\bar{x} → less influential

Always identify and investigate influential points

Regression Toward the Mean

Phenomenon: Extreme x-values tend to predict less extreme y-values

Why? Correlation < 1 (not perfect relationship)

Example: Very tall parents tend to have shorter children (still tall, but less extreme)

Slope formula explains: b=r(sy/sx)b = r(s_y/s_x)

  • Since r < 1, predicted change smaller than proportional

Switching x and y

Regression NOT symmetric!

Different lines:

  • Regression of y on x: y^=a+bx\hat{y} = a + bx
  • Regression of x on y: x^=c+dy\hat{x} = c + dy

These are NOT equivalent!

Use: Predict y from x → use y on x line

Common Mistakes

❌ Interpreting y-intercept when x = 0 meaningless
❌ Extrapolating beyond data range
❌ Confusing slope units
❌ Thinking regression proves causation
❌ Using regression when relationship nonlinear

Causation Reminder

Regression line can be used for prediction

Does NOT prove causation!

Strong relationship ≠ cause-and-effect

Need: Controlled experiment to establish causation

Quick Reference

Equation: y^=a+bx\hat{y} = a + bx

Slope: b=r(sy/sx)b = r(s_y/s_x)

Intercept: a=yˉbxˉa = \bar{y} - b\bar{x}

Line passes through: (xˉ,yˉ)(\bar{x}, \bar{y})

Residual: yy^y - \hat{y}

Least-squares minimizes: (yy^)2\sum(y - \hat{y})^2

Remember: Regression gives best prediction line but doesn't prove causation. Beware extrapolation! Always check for influential points.

📚 Practice Problems

No example problems available yet.