Inference for Regression
Confidence intervals and tests for slope
Inference for Regression
Beyond Description
So far: Described relationship in sample data
Now: Make inferences about population relationship
- Confidence interval for slope
- Hypothesis test for slope
- Prediction intervals
Conditions for Inference (LINE)
L - Linear relationship: Check scatterplot
I - Independent observations: Random sample, n < 10%N
N - Normal distribution of residuals: Check histogram/normal plot of residuals
E - Equal variance: Check residual plot (constant spread)
Must check all before inference!
Slope as Parameter
Sample: b = slope from data
Population: β (beta) = true slope in population
Question: Is there really a relationship, or did we just see pattern by chance?
Hypothesis Test for Slope
Hypotheses:
- H₀: β = 0 (no linear relationship)
- Hₐ: β ≠ 0 (linear relationship exists)
If β = 0: x has no effect on y
Test statistic:
df = n - 2
SE_b (standard error of slope): Provided by calculator/computer
Example 1: Test for Slope
Height (x) and weight (y), n = 25:
b = 4, SE_b = 1.2
STATE:
- β = true slope
- H₀: β = 0
- Hₐ: β ≠ 0
- α = 0.05
PLAN:
- t-test for slope
- Conditions: LINE all checked ✓
DO:
df = 25 - 2 = 23
P-value = 2 × P(t > 3.33) ≈ 0.003
CONCLUDE: P-value < 0.05, reject H₀. Significant linear relationship between height and weight.
Confidence Interval for Slope
Formula:
df = n - 2
Interpretation: "We are C% confident the true slope is between [L] and [U]."
Meaning: For each unit increase in x, y changes by between L and U units (on average in population)
Example 2: CI for Slope
Same data: b = 4, SE_b = 1.2, n = 25
95% CI:
df = 23, t* ≈ 2.069
Interpretation: "We are 95% confident that for each additional inch of height, weight increases by between 1.52 and 6.48 pounds on average."
Relationship Between Test and CI
For two-sided test at α:
Check if (1-α) CI contains 0:
- If 0 in CI → fail to reject H₀
- If 0 not in CI → reject H₀
Example: 95% CI is (1.52, 6.48)
- Doesn't contain 0
- Reject H₀: β = 0 at α = 0.05
Prediction Interval
Different from confidence interval!
Confidence interval: For mean response
Prediction interval: For individual response
Prediction interval is wider (more uncertainty predicting individual)
Formula (approximate):
Where s = standard deviation of residuals
More precise formula accounts for:
- Distance of x from (farther = wider interval)
- Sample size
Example 3: Prediction Interval
Predict weight for height = 70:
= 158, s = 10, n = 25
95% prediction interval (rough):
Interpretation: "We predict an individual with height 70 inches will weigh between 137 and 179 pounds with 95% confidence."
Standard Error of Slope
Formula:
Where s = standard deviation of residuals
Factors making SE_b smaller:
- Smaller s (points closer to line)
- Larger sample size n
- More spread in x-values
Smaller SE_b → narrower CI → more precise estimate
Checking Conditions
Linearity:
- Scatterplot roughly linear
- Residual plot shows no curve
Independence:
- Random sample
- No time trends
- Each observation independent
Normality:
- Histogram of residuals roughly normal
- Normal probability plot roughly linear
- Less critical for large n (CLT)
Equal Variance:
- Residual plot shows constant spread
- No fan shape
What if Conditions Not Met?
Nonlinear: Transform variables or use nonlinear methods
Not normal (small n): Be cautious with inference
Not equal variance: Consider transformation or weighted regression
Not independent: Use time series or other methods
Don't ignore violations! Inference may be invalid
Prediction vs Confidence Interval
Confidence Interval for Mean Response:
- "Average y for all individuals with x = x₀"
- Narrower
- Use: Policy decisions, understanding average effect
Prediction Interval for Individual:
- "Single y value for one individual with x = x₀"
- Wider (includes individual variability)
- Use: Predicting specific outcome
Always wider: Prediction interval > confidence interval
Multiple Regression Preview
So far: One explanatory variable
Multiple regression: Several explanatory variables
Can test each slope: Does this variable help predict y (controlling for others)?
Beyond AP Stats but important to know exists
Common Mistakes
❌ Not checking LINE conditions
❌ Using normal instead of t-distribution
❌ Confusing prediction and confidence intervals
❌ Using df = n instead of n - 2
❌ Making inference when conditions violated
Practical Significance
Statistical significance (P < 0.05) doesn't mean practical importance
Example: Slope = 0.01, P = 0.001
- Statistically significant
- But is 0.01 change per unit practically meaningful?
Consider:
- Effect size (magnitude of slope)
- Context
- Practical implications
Quick Reference
Test for slope: , df = n - 2
CI for slope:
Conditions: LINE (Linear, Independent, Normal, Equal variance)
Prediction interval: Wider than confidence interval
0 in CI for slope? → No significant relationship
Remember: Check LINE conditions before inference! Inference lets us extend conclusions beyond our sample to the broader population, but only if conditions are met.
📚 Practice Problems
1Problem 1medium
❓ Question:
A regression of study hours (x) on test scores (y) gives slope b₁ = 5.2 with SE = 1.3, n = 20. Construct a 95% confidence interval for the true slope β₁.
💡 Show Solution
Step 1: Identify given information Slope: b₁ = 5.2 Standard error: SE = 1.3 Sample size: n = 20 Confidence level: 95%
Step 2: Find degrees of freedom df = n - 2 = 20 - 2 = 18 (Use n-2 for regression, not n-1)
Step 3: Find t* critical value From t-table with df = 18, 95% confidence: t* = 2.101
Step 4: Calculate margin of error ME = t* × SE ME = 2.101 × 1.3 ME ≈ 2.73
Step 5: Construct confidence interval CI = b₁ ± ME CI = 5.2 ± 2.73 CI = (2.47, 7.93)
Step 6: Interpret "We are 95% confident that for each additional hour studied, the true mean increase in test score is between 2.47 and 7.93 points."
Note: Since 0 is NOT in the interval, there is significant evidence of a positive relationship (can reject H₀: β₁ = 0).
Answer: 95% CI: (2.47, 7.93) points per hour
2Problem 2medium
❓ Question:
Test H₀: β₁ = 0 vs Hₐ: β₁ ≠ 0 given b₁ = 3.5, SE = 1.2, n = 25, α = 0.05.
💡 Show Solution
Step 1: Set up hypotheses H₀: β₁ = 0 (no relationship) Hₐ: β₁ ≠ 0 (relationship exists)
Two-tailed test, α = 0.05
Step 2: Check conditions LINEAR: Assume scatterplot is linear ✓ INDEPENDENT: Assume random sample, n < 10% population ✓ NORMAL: Residuals approximately normal ✓ EQUAL VARIANCE: Residual plot shows constant spread ✓ RANDOM: Random sample ✓
(LINE conditions for regression inference)
Step 3: Calculate test statistic df = n - 2 = 25 - 2 = 23
t = (b₁ - 0)/SE t = 3.5/1.2 t ≈ 2.917
Step 4: Find p-value From t-table with df = 23, two-tailed: t = 2.917 is between t = 2.807 (p = 0.01) and t = 3.767 (p = 0.001)
So: 0.001 < p-value < 0.01
More precisely: p-value ≈ 0.0077
Step 5: Make decision p-value (0.0077) < α (0.05) REJECT H₀
Step 6: Conclusion in context "There is significant evidence (p = 0.008) that a linear relationship exists between x and y. The slope is significantly different from zero."
Answer: t = 2.92, p-value ≈ 0.008. Reject H₀. Significant evidence of linear relationship.
3Problem 3medium
❓ Question:
What are the conditions (LINE) for inference in regression? Explain each briefly.
💡 Show Solution
The LINE conditions for regression inference:
L - LINEAR Relationship between x and y is linear Check: Scatterplot should show linear pattern Residual plot should show no curve
I - INDEPENDENT
Observations are independent
Check: Random sampling
n < 10% of population (if sampling without replacement)
No time series or repeated measures
N - NORMAL Residuals are approximately normally distributed Check: Histogram or normal probability plot of residuals Not critical if n is large (n ≥ 30) Just need no strong skewness or outliers
E - EQUAL VARIANCE (also called homoscedasticity) Variability of y is constant for all x Check: Residual plot shows roughly equal vertical spread No fan shape or other pattern in spread
Why these matter:
- LINEAR: For model to be appropriate
- INDEPENDENT: For formulas to be valid
- NORMAL: For t-distribution to apply (especially small samples)
- EQUAL VARIANCE: For standard errors to be correct
If violations:
- Not linear → transform or use nonlinear model
- Not independent → use different methods (time series, etc.)
- Not normal → okay if n ≥ 30; otherwise transform
- Not equal variance → transform or use weighted regression
Answer: LINE = Linear relationship, Independent observations, Normal residuals, Equal variance. Check using scatterplot, residual plot, and normal probability plot.
4Problem 4medium
❓ Question:
Computer output shows: b₁ = 2.4, SE(b₁) = 0.8, t = 3.0, p = 0.006, n = 22. Interpret the p-value in context.
💡 Show Solution
Step 1: Identify the test Testing: H₀: β₁ = 0 (no relationship) Against: Hₐ: β₁ ≠ 0 (relationship exists)
Given: p-value = 0.006
Step 2: What p-value means statistically The probability of observing a slope as extreme as 2.4 (or more extreme) IF the true slope is actually 0.
Step 3: Interpret in context "If there were truly no linear relationship between x and y (β₁ = 0), the probability of obtaining a sample slope of 2.4 or more extreme (in either direction) is 0.006, or 0.6%."
Step 4: Practical interpretation This is very unlikely (less than 1% chance)!
Therefore: Strong evidence AGAINST H₀ The relationship is statistically significant.
Step 5: Decision at α = 0.05 Since p-value (0.006) < α (0.05): REJECT H₀
Conclusion: "There is strong evidence of a significant linear relationship. The slope is significantly different from zero (p = 0.006)."
Step 6: What this does NOT mean ✗ Does not mean slope is definitely 2.4 ✗ Does not mean x causes y ✗ Does not mean model fits well (could still have problems) ✓ Only means: slope significantly different from zero
Answer: If true slope were 0, probability of getting b₁ = 2.4 or more extreme is only 0.006. This provides strong evidence the slope is not zero - there is a significant linear relationship.
5Problem 5hard
❓ Question:
Why do we use t-distribution with df = n-2 for regression inference instead of df = n-1?
💡 Show Solution
Step 1: Compare to one-sample t-test One-sample t-test: df = n - 1
- Estimate 1 parameter: μ
- Lose 1 df
Regression: df = n - 2
- Estimate 2 parameters: β₀ AND β₁
- Lose 2 df
Step 2: What we're estimating In regression, we estimate:
- Intercept (β₀)
- Slope (β₁)
Both use up degrees of freedom!
Step 3: Degrees of freedom explained Start with n observations
- Use one to estimate β₀ (intercept)
- Use one to estimate β₁ (slope)
- Left with n - 2 for error estimation
df = n - 2
Step 4: Why it matters Smaller df → wider t* critical values → wider CIs
Example: n = 10, 95% confidence
- One-sample (df = 9): t* = 2.262
- Regression (df = 8): t* = 2.306
Regression CI slightly wider (more uncertainty).
Step 5: As n increases For large n, the difference is minimal:
- df = 100 vs 98 → nearly same t*
- Both approach z* = 1.96
Step 6: General pattern Degrees of freedom = n - (number of parameters estimated)
- Mean only: n - 1
- Regression: n - 2
- Multiple regression with k predictors: n - k - 1
Answer: We estimate TWO parameters (β₀ and β₁), so we lose 2 degrees of freedom, giving df = n - 2. This accounts for the extra uncertainty from estimating both intercept and slope.
Practice with Flashcards
Review key concepts with our flashcard system
Browse All Topics
Explore other calculus topics