Least-Squares Regression

Finding the line of best fit

Least-Squares Regression

Regression Line

Purpose: Find best-fit line through scatterplot

Equation:

y^=a+bx\hat{y} = a + bx

Where:

  • y^\hat{y} = predicted value of y
  • b = slope
  • a = y-intercept
  • x = value of explanatory variable

Least-Squares Criterion

Least-squares regression line: Line minimizing sum of squared residuals

Residual: Difference between observed and predicted

residual=yy^\text{residual} = y - \hat{y}

Least-squares minimizes: (yy^)2\sum (y - \hat{y})^2

Why square? Positive and negative deviations don't cancel

Formulas for Slope and Intercept

Slope:

b=rsysxb = r \frac{s_y}{s_x}

Where:

  • r = correlation
  • s_y = standard deviation of y
  • s_x = standard deviation of x

y-intercept:

a=yˉbxˉa = \bar{y} - b\bar{x}

Key insight: Line always passes through (xˉ,yˉ)(\bar{x}, \bar{y})

Example: Finding Regression Line

Data: Height (x) and weight (y) of 5 people

xˉ=68\bar{x} = 68, sx=4s_x = 4
yˉ=150\bar{y} = 150, sy=20s_y = 20
r=0.8r = 0.8

Slope:

b=0.8×204=0.8×5=4b = 0.8 \times \frac{20}{4} = 0.8 \times 5 = 4

Intercept:

a=1504(68)=150272=122a = 150 - 4(68) = 150 - 272 = -122

Equation:

y^=122+4x\hat{y} = -122 + 4x

Interpretation: For each inch increase in height, predicted weight increases by 4 pounds.

Interpreting Slope

Slope b = change in y^\hat{y} per unit increase in x

Template: "For each [1 unit] increase in [x], predicted [y] [increases/decreases] by [|b|] [y-units]."

Example: b = 4 in height/weight

"For each 1-inch increase in height, predicted weight increases by 4 pounds."

Negative slope: "decreases by..."

Interpreting y-Intercept

y-intercept a = predicted y when x = 0

Often meaningless!

  • Height = 0 → weight = -122 pounds? Nonsense!

Only interpret if x = 0 is meaningful and within data range

Example where meaningful:

  • y = test score, x = hours studied
  • a = predicted score with 0 hours studying

Making Predictions

Substitute x into equation:

y^=a+bx\hat{y} = a + bx

Example: Predict weight for height = 70 inches

y^=122+4(70)=122+280=158 pounds\hat{y} = -122 + 4(70) = -122 + 280 = 158 \ \text{pounds}

Caution: Extrapolation (predicting outside data range) is risky!

Extrapolation

Interpolation: Predict within range of data ✓

Extrapolation: Predict outside range of data ⚠

Problem with extrapolation:

  • Relationship may not continue
  • May become nonlinear
  • Other factors may matter

Example: Predicting weight for height = 100 inches

  • Well outside typical range
  • Relationship might not hold
  • Prediction unreliable

Calculator Method

TI-83/84:

  1. Enter data in L1 (x) and L2 (y)
  2. STAT → CALC → 8:LinReg(a+bx)
  3. Read a, b, r, r²

Result shows:

  • y = a + bx
  • r (correlation)
  • r² (coefficient of determination)

Properties of Regression Line

1. Passes through (xˉ\bar{x}, yˉ\bar{y})

2. Sum of residuals = 0

  • Positive and negative balance out

3. Unique (only one least-squares line)

4. Sensitive to outliers

  • One outlier can drastically change line

Residuals

Residual = observed - predicted = y - y^\hat{y}

Positive residual: Point above line (underestimate)
Negative residual: Point below line (overestimate)
Zero residual: Point on line (exact prediction)

Example: Actual weight = 160, predicted = 158

  • Residual = 160 - 158 = 2 pounds
  • Underestimated by 2 pounds

Influential Points

Influential point: Removing it substantially changes regression line

Usually:

  • Outliers in x-direction (far from xˉ\bar{x})
  • Have high leverage (pull line toward them)

Not all outliers are influential!

  • Outlier in y-direction but near xˉ\bar{x} → less influential

Always identify and investigate influential points

Regression Toward the Mean

Phenomenon: Extreme x-values tend to predict less extreme y-values

Why? Correlation < 1 (not perfect relationship)

Example: Very tall parents tend to have shorter children (still tall, but less extreme)

Slope formula explains: b=r(sy/sx)b = r(s_y/s_x)

  • Since r < 1, predicted change smaller than proportional

Switching x and y

Regression NOT symmetric!

Different lines:

  • Regression of y on x: y^=a+bx\hat{y} = a + bx
  • Regression of x on y: x^=c+dy\hat{x} = c + dy

These are NOT equivalent!

Use: Predict y from x → use y on x line

Common Mistakes

❌ Interpreting y-intercept when x = 0 meaningless
❌ Extrapolating beyond data range
❌ Confusing slope units
❌ Thinking regression proves causation
❌ Using regression when relationship nonlinear

Causation Reminder

Regression line can be used for prediction

Does NOT prove causation!

Strong relationship ≠ cause-and-effect

Need: Controlled experiment to establish causation

Quick Reference

Equation: y^=a+bx\hat{y} = a + bx

Slope: b=r(sy/sx)b = r(s_y/s_x)

Intercept: a=yˉbxˉa = \bar{y} - b\bar{x}

Line passes through: (xˉ,yˉ)(\bar{x}, \bar{y})

Residual: yy^y - \hat{y}

Least-squares minimizes: (yy^)2\sum(y - \hat{y})^2

Remember: Regression gives best prediction line but doesn't prove causation. Beware extrapolation! Always check for influential points.

📚 Practice Problems

1Problem 1medium

Question:

A study measures hours studied (x) and test scores (y) for 5 students: (2,65), (3,70), (4,75), (5,80), (6,85). Given x̄ = 4, ȳ = 75, calculate the least-squares regression line.

💡 Show Solution

Step 1: Calculate slope b₁ Formula: b₁ = Σ(x-x̄)(y-ȳ) / Σ(x-x̄)²

Create table: | x | y | (x-x̄) | (y-ȳ) | (x-x̄)(y-ȳ) | (x-x̄)² | |---|-------|--------|---------| | 2 | 65 | -2 | -10 | 20 | 4 | | 3 | 70 | -1 | -5 | 5 | 1 | | 4 | 75 | 0 | 0 | 0 | 0 | | 5 | 80 | 1 | 5 | 5 | 1 | | 6 | 85 | 2 | 10 | 20 | 4 |

Σ(x-x̄)(y-ȳ) = 50 Σ(x-x̄)² = 10

b₁ = 50/10 = 5

Step 2: Calculate y-intercept b₀ b₀ = ȳ - b₁x̄ b₀ = 75 - 5(4) b₀ = 75 - 20 = 55

Step 3: Write equation ŷ = 55 + 5x

Interpretation: Each additional hour studied predicts a 5-point increase in test score.

Answer: ŷ = 55 + 5x

2Problem 2medium

Question:

For data with Σx = 50, Σy = 120, Σx² = 350, Σxy = 720, n = 10, find the least-squares regression line.

💡 Show Solution

Step 1: Calculate means x̄ = Σx/n = 50/10 = 5 ȳ = Σy/n = 120/10 = 12

Step 2: Calculate slope Formula: b₁ = [Σxy - n(x̄)(ȳ)] / [Σx² - n(x̄)²]

Numerator: Σxy - n(x̄)(ȳ) = 720 - 10(5)(12) = 720 - 600 = 120 Denominator: Σx² - n(x̄)² = 350 - 10(5)² = 350 - 250 = 100

b₁ = 120/100 = 1.2

Step 3: Calculate intercept b₀ = ȳ - b₁x̄ = 12 - 1.2(5) = 12 - 6 = 6

Step 4: Write equation ŷ = 6 + 1.2x

Verification: When x = 5, ŷ = 6 + 6 = 12 = ȳ ✓

Answer: ŷ = 6 + 1.2x

3Problem 3easy

Question:

A regression of car weight (x, in 1000s of lbs) on fuel efficiency (y, mpg) gives ŷ = 45 - 5.2x. Interpret the slope and predict mpg for a 3,500 lb car.

💡 Show Solution

Step 1: Interpret slope Slope = -5.2 mpg per 1000 lbs

Interpretation: "For each additional 1,000 pounds of car weight, fuel efficiency is predicted to DECREASE by 5.2 miles per gallon."

The negative slope makes sense: heavier cars use more fuel.

Step 2: Convert weight to correct units Car weight = 3,500 lbs = 3.5 thousands of lbs So x = 3.5

Step 3: Make prediction ŷ = 45 - 5.2(3.5) ŷ = 45 - 18.2 ŷ = 26.8 mpg

Step 4: Complete interpretation "A car weighing 3,500 pounds is predicted to have fuel efficiency of approximately 26.8 miles per gallon."

Answer: Slope: Each 1,000 lb increase predicts 5.2 mpg decrease Prediction: 26.8 mpg

4Problem 4medium

Question:

Given x̄ = 15, ȳ = 240, sₓ = 4, sᵧ = 60, r = 0.75, find the regression line using b₁ = r(sᵧ/sₓ).

💡 Show Solution

Step 1: Calculate slope Formula: b₁ = r(sᵧ/sₓ)

b₁ = 0.75 × (60/4) b₁ = 0.75 × 15 b₁ = 11.25

Step 2: Calculate y-intercept b₀ = ȳ - b₁x̄ b₀ = 240 - 11.25(15) b₀ = 240 - 168.75 b₀ = 71.25

Step 3: Write regression equation ŷ = 71.25 + 11.25x

Interpretation: Each 1-unit increase in x predicts an 11.25-unit increase in y.

Answer: ŷ = 71.25 + 11.25x

5Problem 5medium

Question:

The regression of temperature (°F) vs ice cream sales ($) is ŷ = -2 + 0.8x. Is it appropriate to predict sales when temp = 0°F? Explain.

💡 Show Solution

Step 1: Make the prediction ŷ = -2 + 0.8(0) = -2

This predicts -$2 in sales, which is IMPOSSIBLE!

Step 2: Identify the problem This is EXTRAPOLATION - predicting outside the data range.

Issues:

  1. Temperature of 0°F likely outside original data range
  2. Linear relationship may not hold at extremes
  3. Model gives nonsensical result (negative sales)
  4. Y-intercept is just a mathematical constant, not meaningful here

Step 3: Proper approach Should only use regression for INTERPOLATION (within data range). If data collected at 60-100°F, only predict in that range.

Answer: NO - This is inappropriate extrapolation resulting in an impossible prediction. Only use regression within the range of observed x-values.