Coefficient of Determination (r²)

What is r²?

Coefficient of Determination (r²): Proportion of variability in y explained by linear relationship with x

Formula:

$r^2 = (r)^2$

Where r is the correlation coefficient

Range: 0 ≤ r² ≤ 1 (or 0% to 100%)

Interpreting r²

Template: "About [r² × 100]% of the variability in [y] is explained by the linear relationship with [x]."

Example: r² = 0.64

"About 64% of the variability in test scores is explained by the linear relationship with study hours."

Remaining variability (1 - r²):

Due to other variables
Random variation
Unexplained by this model

Example 1: Calculating r²

Height and weight: r = 0.8

$r^2 = (0.8)^2 = 0.64$

Interpretation: "About 64% of the variability in weight is explained by the linear relationship with height. The remaining 36% is due to other factors."

r² vs r

r (correlation):

Shows strength AND direction
Range: -1 to 1
Negative values meaningful

r² (coefficient of determination):

Shows strength only (no direction)
Range: 0 to 1
Always positive
Easier to interpret as percentage

From r² cannot determine if relationship positive or negative!

Need to also report r or slope

What r² Means

r² = 0.90: Model explains 90% of variability (excellent fit)

r² = 0.70: Model explains 70% of variability (good fit)

r² = 0.50: Model explains 50% of variability (moderate fit)

r² = 0.25: Model explains 25% of variability (weak fit)

r² = 0: Model explains none of variability (no linear relationship)

Note: These are rough guidelines, context dependent!

Visualizing r²

Think of variability in y:

Total variability: How much y-values spread out from $\bar{y}$

Explained variability: How much $\hat{y}$ varies (due to linear relationship)

Unexplained variability: How much points deviate from line (residuals)

$\text{Total} = \text{Explained} + \text{Unexplained}$

$r^2 = \frac{\text{Explained}}{\text{Total}}$

Formal Definition

$r^2 = \frac{\sum(\hat{y} - \bar{y})^2}{\sum(y - \bar{y})^2}$

Numerator: Variability in predictions
Denominator: Total variability in y

Equivalently:

$r^2 = 1 - \frac{\sum(y - \hat{y})^2}{\sum(y - \bar{y})^2}$

$r^2 = 1 - \frac{\text{Unexplained}}{\text{Total}}$

Example 2: Detailed Calculation

Data: 5 points with $\bar{y}$ = 10

Total variability: $\sum(y - \bar{y})^2$ = 100

Unexplained (residuals): $\sum(y - \hat{y})^2$ = 25

$r^2 = 1 - \frac{25}{100} = 1 - 0.25 = 0.75$

Interpretation: 75% of variability explained, 25% unexplained

What r² Does NOT Mean

❌ r² is NOT probability

Not "probability model is correct"
Not "probability prediction is right"

❌ r² does NOT prove causation

High r² doesn't mean x causes y
Could be coincidence or confounding

❌ r² alone doesn't guarantee good model

Could have high r² but residuals show pattern
Always check residual plot!

❌ r² doesn't tell about prediction accuracy for individuals

Use s (standard error) for that

When is r² High?

High r² occurs when:

Strong linear relationship (|r| close to 1)
Points close to regression line
Little unexplained variability
x is good predictor of y

Does NOT require:

Large sample size (can have high r² with small n)
Causation
Practical importance

When is r² Low?

Low r² occurs when:

Weak linear relationship
Lots of scatter around line
Much unexplained variability
x is poor predictor of y

Possible reasons:

No relationship exists
Relationship is nonlinear
Other variables more important
High natural variability in y

Comparing Models

Use r² to compare models on same data:

Model 1: Height predicting weight, r² = 0.64
Model 2: Age predicting weight, r² = 0.45

Conclusion: Height explains more variability (better predictor)

Caution: Only compare r² for same response variable!

Adjusted r²

For multiple regression (multiple explanatory variables)

Problem: Adding variables always increases r² (even useless variables!)

Adjusted r²: Penalizes for number of variables

$r_{adj}^2 = 1 - \frac{(1-r^2)(n-1)}{n-k-1}$

Where k = number of explanatory variables

Use: Compare models with different numbers of variables

Relationship to Standard Error

Related concepts:

r²: Proportion of variability explained

s: Typical prediction error (in original units)

Both measure model fit:

High r² ↔ small s
Low r² ↔ large s

s often more useful for predictions (gives actual error magnitude)

Common Mistakes

❌ Saying "r² is probability"
❌ Thinking high r² proves causation
❌ Using r² alone without checking residual plot
❌ Comparing r² across different response variables
❌ Not reporting direction of relationship (r² loses sign)

Practical Significance

Statistical vs Practical:

High r² in context:

Social sciences: r² > 0.50 often considered good
Physical sciences: r² > 0.90 often expected
Individual predictions: Even r² = 0.90 may not be precise enough

Consider:

What's typical in your field?
What's needed for practical use?
What's the cost of prediction errors?

Reporting Results

Complete report includes:

Correlation (r): Shows direction and strength
r²: Shows percent variability explained
Equation: $\hat{y} = a + bx$
Standard error (s): Typical prediction error
Residual plot: Visual check of model appropriateness

Don't report r² alone!

Quick Reference

r²: Proportion of variability in y explained by x

Formula: r² = (correlation)²

Range: 0 to 1 (0% to 100%)

Interpretation: "[r² × 100]% of variability in y explained by linear relationship with x"

High r²: Good fit, points close to line
Low r²: Poor fit, much unexplained variability

Remember: r² measures how well x predicts y, but doesn't prove causation. Always check residual plot! High r² alone doesn't guarantee good model.

Coefficient of Determination