Least-Squares Regression

Regression Line

Purpose: Find best-fit line through scatterplot

Equation:

$\hat{y} = a + bx$

Where:

$\hat{y}$ = predicted value of y
b = slope
a = y-intercept
x = value of explanatory variable

Least-Squares Criterion

Least-squares regression line: Line minimizing sum of squared residuals

Residual: Difference between observed and predicted

$\text{residual} = y - \hat{y}$

Least-squares minimizes: $\sum (y - \hat{y})^2$

Why square? Positive and negative deviations don't cancel

Formulas for Slope and Intercept

Slope:

$b = r \frac{s_y}{s_x}$

Where:

r = correlation
s_y = standard deviation of y
s_x = standard deviation of x

y-intercept:

$a = \bar{y} - b\bar{x}$

Key insight: Line always passes through $(\bar{x}, \bar{y})$

Example: Finding Regression Line

Data: Height (x) and weight (y) of 5 people

$\bar{x} = 68$ , $s_x = 4$
$\bar{y} = 150$ , $s_y = 20$
$r = 0.8$

Slope:

$b = 0.8 \times \frac{20}{4} = 0.8 \times 5 = 4$

Intercept:

$a = 150 - 4(68) = 150 - 272 = -122$

Equation:

$\hat{y} = -122 + 4x$

Interpretation: For each inch increase in height, predicted weight increases by 4 pounds.

Interpreting Slope

Slope b = change in $\hat{y}$ per unit increase in x

Template: "For each [1 unit] increase in [x], predicted [y] [increases/decreases] by [|b|] [y-units]."

Example: b = 4 in height/weight

"For each 1-inch increase in height, predicted weight increases by 4 pounds."

Negative slope: "decreases by..."

Interpreting y-Intercept

y-intercept a = predicted y when x = 0

Often meaningless!

Height = 0 → weight = -122 pounds? Nonsense!

Only interpret if x = 0 is meaningful and within data range

Example where meaningful:

y = test score, x = hours studied
a = predicted score with 0 hours studying

Making Predictions

Substitute x into equation:

$\hat{y} = a + bx$

Example: Predict weight for height = 70 inches

$\hat{y} = -122 + 4(70) = -122 + 280 = 158 \ \text{pounds}$

Caution: Extrapolation (predicting outside data range) is risky!

Extrapolation

Interpolation: Predict within range of data ✓

Extrapolation: Predict outside range of data ⚠

Problem with extrapolation:

Relationship may not continue
May become nonlinear
Other factors may matter

Example: Predicting weight for height = 100 inches

Well outside typical range
Relationship might not hold
Prediction unreliable

Calculator Method

TI-83/84:

Enter data in L1 (x) and L2 (y)
STAT → CALC → 8:LinReg(a+bx)
Read a, b, r, r²

Result shows:

y = a + bx
r (correlation)
r² (coefficient of determination)

Properties of Regression Line

1. Passes through ( $\bar{x}$ , $\bar{y}$ )

2. Sum of residuals = 0

Positive and negative balance out

3. Unique (only one least-squares line)

4. Sensitive to outliers

One outlier can drastically change line

Residuals

Residual = observed - predicted = y - $\hat{y}$

Positive residual: Point above line (underestimate)
Negative residual: Point below line (overestimate)
Zero residual: Point on line (exact prediction)

Example: Actual weight = 160, predicted = 158

Residual = 160 - 158 = 2 pounds
Underestimated by 2 pounds

Influential Points

Influential point: Removing it substantially changes regression line

Usually:

Outliers in x-direction (far from $\bar{x}$ )
Have high leverage (pull line toward them)

Not all outliers are influential!

Outlier in y-direction but near $\bar{x}$ → less influential

Always identify and investigate influential points

Regression Toward the Mean

Phenomenon: Extreme x-values tend to predict less extreme y-values

Why? Correlation < 1 (not perfect relationship)

Example: Very tall parents tend to have shorter children (still tall, but less extreme)

Slope formula explains: $b = r(s_y/s_x)$

Since r < 1, predicted change smaller than proportional

Switching x and y

Regression NOT symmetric!

Different lines:

Regression of y on x: $\hat{y} = a + bx$
Regression of x on y: $\hat{x} = c + dy$

These are NOT equivalent!

Use: Predict y from x → use y on x line

Common Mistakes

❌ Interpreting y-intercept when x = 0 meaningless
❌ Extrapolating beyond data range
❌ Confusing slope units
❌ Thinking regression proves causation
❌ Using regression when relationship nonlinear

Causation Reminder

Regression line can be used for prediction

Does NOT prove causation!

Strong relationship ≠ cause-and-effect

Need: Controlled experiment to establish causation

Quick Reference

Equation: $\hat{y} = a + bx$

Slope: $b = r(s_y/s_x)$

Intercept: $a = \bar{y} - b\bar{x}$

Line passes through: $(\bar{x}, \bar{y})$

Residual: $y - \hat{y}$

Least-squares minimizes: $\sum(y - \hat{y})^2$

Remember: Regression gives best prediction line but doesn't prove causation. Beware extrapolation! Always check for influential points.

Least-Squares Regression

Least-Squares Regression

Regression Line

Least-Squares Criterion

Formulas for Slope and Intercept

Example: Finding Regression Line

Interpreting Slope

Interpreting y-Intercept

Making Predictions

Extrapolation

Calculator Method

Properties of Regression Line

Residuals

Influential Points

Regression Toward the Mean

Switching x and y

Common Mistakes

Causation Reminder

Quick Reference

📚 Practice Problems

Practice with Flashcards

Browse All Topics