Hypothesis Testing Framework

Null and alternative hypotheses, significance level

Hypothesis Testing Framework

What is Hypothesis Testing?

Hypothesis Test: Formal procedure to decide between two competing claims about a population parameter

Two hypotheses:

Null hypothesis (H₀): Status quo, no effect, no difference
Alternative hypothesis (Hₐ or H₁): What we're trying to show

Goal: Determine if data provides sufficient evidence to reject H₀ in favor of Hₐ

Setting Up Hypotheses

H₀: Always includes equality (=, ≤, ≥)

Hₐ: Can be:

Two-sided: μ ≠ μ₀ (different from)
Right-sided: μ > μ₀ (greater than)
Left-sided: μ < μ₀ (less than)

Examples:

Claim: Mean height > 68 inches

H₀: μ = 68 or μ ≤ 68
Hₐ: μ > 68

Claim: Proportion ≠ 0.5

H₀: p = 0.5
Hₐ: p ≠ 0.5

The Four-Step Process

Step 1: STATE

Parameter of interest
Hypotheses (H₀ and Hₐ)
Significance level α

Step 2: PLAN

Choose appropriate test
Check conditions

Step 3: DO

Calculate test statistic
Find P-value

Step 4: CONCLUDE

Compare P-value to α
State conclusion in context

Test Statistic

General form:

$\text{Test statistic} = \frac{\text{statistic} - \text{parameter}}{\text{standard error}}$

For means (t-test):

$t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$

For proportions (z-test):

$z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$

Measures: How many standard errors the statistic is from hypothesized parameter

P-Value

P-value: Probability of getting results as extreme or more extreme than observed, assuming H₀ is true

Interpretation:

Small P-value → data inconsistent with H₀ → evidence against H₀
Large P-value → data consistent with H₀ → insufficient evidence against H₀

Finding P-value:

Two-sided: P(|test statistic| ≥ observed)
Right-sided: P(test statistic ≥ observed)
Left-sided: P(test statistic ≤ observed)

Significance Level (α)

α: Threshold for rejecting H₀

Common values: 0.05, 0.01, 0.10

Decision rule:

If P-value ≤ α → Reject H₀
If P-value > α → Fail to reject H₀

Note: "Fail to reject" ≠ "accept" H₀ (lack of evidence against ≠ evidence for)

Example: Complete Test

Claim: Mean score exceeds 75. Sample: n = 30, $\bar{x}$ = 78, s = 10

STATE:

Parameter: μ = true mean score
H₀: μ = 75
Hₐ: μ > 75
α = 0.05

PLAN:

One-sample t-test
Conditions: Random ✓, n = 30 ≥ 30 ✓, n < 10%N ✓

DO: $t = \frac{78 - 75}{10/\sqrt{30}} = \frac{3}{1.826} \approx 1.64$

df = 29, P-value ≈ 0.056 (from tcdf)

CONCLUDE: P-value = 0.056 > 0.05, fail to reject H₀. Insufficient evidence that mean exceeds 75.

One-Sided vs Two-Sided Tests

Two-sided: Looking for any difference

Hₐ: μ ≠ μ₀
P-value = 2 × P(|t| ≥ observed)

One-sided: Looking for specific direction

Hₐ: μ > μ₀ or μ < μ₀
P-value = P(t ≥ observed) or P(t ≤ observed)

Choose before seeing data! One-sided only if direction specified in advance

Statistical Significance

Statistically significant: P-value ≤ α

Interpretation: Result unlikely to occur by chance alone if H₀ true

NOT the same as practically significant!

Can have statistically significant but tiny effect
Large sample can detect trivial differences

Relationship to Confidence Intervals

For two-sided test at α = 0.05:

Equivalent to checking if (1-α) CI contains H₀ value

If μ₀ in 95% CI → P-value > 0.05
If μ₀ not in 95% CI → P-value ≤ 0.05

CI gives more information: Range of plausible values, not just yes/no

Common Misconceptions

❌ "P-value is probability H₀ is true"

No! It's P(data | H₀), not P(H₀ | data)

❌ "Fail to reject H₀ means H₀ is true"

No! Just insufficient evidence against it

❌ "Significant means important"

No! Statistically significant ≠ practically important

❌ "P-value is probability of error"

No! That's α (if we reject H₀)

Writing Conclusions

✓ Good: "We have sufficient evidence to conclude the mean exceeds 75."

✓ Good: "There is insufficient evidence that the proportion differs from 0.5."

✗ Bad: "We prove the mean is 75."

✗ Bad: "We accept H₀."

✗ Bad: "The probability H₀ is true is 0.056."

Quick Reference

Hypotheses:

H₀: includes =
Hₐ: what we're testing for

Test statistic: (statistic - parameter) / SE

P-value: P(as extreme | H₀ true)

Decision:

P ≤ α: Reject H₀
P > α: Fail to reject H₀

Remember: Hypothesis testing is about evidence, not proof. Small P-value = strong evidence against H₀, but never proves Hₐ!

📚 Practice Problems

1Problem 1easy

❓ Question:

A manufacturer claims their batteries last an average of 500 hours. You suspect they last less than claimed. Set up appropriate hypotheses to test this claim.

💡 Show Solution

Step 1: Identify the claim Manufacturer claims: μ = 500 hours

Step 2: Identify what we suspect We suspect: μ < 500 hours (Batteries last LESS than claimed)

Step 3: Set up null hypothesis H₀ H₀: μ = 500

The null hypothesis:

Assumes the claim is true
"Status quo" or "no effect"
Equality statement
What we're testing against

Step 4: Set up alternative hypothesis Hₐ Hₐ: μ < 500

The alternative hypothesis:

What we suspect/want to show
Research hypothesis
What we have evidence for
Inequality statement

Step 5: Determine test type This is a ONE-TAILED (left-tailed) test

Why?

Hₐ: μ < 500 (less than)
Only interested in one direction
Looking for evidence batteries last LESS
Not testing if they last MORE

Step 6: Why this setup? Burden of proof on us:

Manufacturer claims 500 hours
We must provide evidence against claim
Start assuming claim is true (H₀)
Collect data to see if claim is unreasonable

Step 7: Connection to significance Will collect sample data:

Calculate x̄ and s
If x̄ is MUCH less than 500
Evidence against H₀
Might reject H₀

If x̄ is close to 500:

Insufficient evidence against H₀
Fail to reject H₀
Can't conclude batteries last less

Answer: H₀: μ = 500 hours (null hypothesis: claim is true) Hₐ: μ < 500 hours (alternative: batteries last less than claimed)

This is a one-tailed (left-tailed) test because we're only testing if the mean is less than 500, not different from 500.

2Problem 2easy

❓ Question:

Explain the difference between null and alternative hypotheses. Why do we set them up this way?

💡 Show Solution

Step 1: Null Hypothesis (H₀) Definition: Statement of "no effect" or "no difference"

Assumes status quo
Contains equality (=, ≤, ≥)
What we test against
Presumed true until evidence says otherwise

Examples:

μ = 50 (parameter equals specific value)
μ₁ = μ₂ (two means are equal)
p = 0.5 (proportion equals 0.5)

Step 2: Alternative Hypothesis (Hₐ or H₁) Definition: Statement of what we want to show

Research hypothesis
What we suspect is true
Contains inequality (<, >, ≠)
Needs evidence to support

Examples:

μ < 50 (one-tailed)
μ > 50 (one-tailed)
μ ≠ 50 (two-tailed)
μ₁ > μ₂ (one group higher)

Step 3: Why this setup? (Legal analogy) Like a trial:

H₀ = "Defendant is innocent"

Presumed true (innocent until proven guilty)
Status quo

Hₐ = "Defendant is guilty"

What prosecutor wants to show
Needs strong evidence

We don't prove innocence! We either:

Find enough evidence for guilty (reject H₀)
Don't find enough evidence (fail to reject H₀)

Step 4: Types of alternative hypotheses

TWO-TAILED (≠): Hₐ: μ ≠ 50

Parameter is different (either direction)
Don't know which way
Testing for ANY difference

ONE-TAILED, RIGHT (>): Hₐ: μ > 50

Parameter is greater
Specific direction
Only interested in increase

ONE-TAILED, LEFT (<): Hₐ: μ < 50

Parameter is less
Specific direction
Only interested in decrease

Step 5: How they work together Must be:

Complementary (cover all possibilities)
Mutually exclusive (can't both be true)

Examples: H₀: μ = 50 and Hₐ: μ ≠ 50 ✓ H₀: μ ≥ 50 and Hₐ: μ < 50 ✓ H₀: μ ≤ 50 and Hₐ: μ > 50 ✓

Step 6: Burden of proof Null hypothesis:

Assumed true
Skeptical position
"Nothing is happening"

Alternative hypothesis:

Must provide evidence
Burden of proof on us
Need convincing data

Step 7: Decision framework After collecting data:

If evidence is strong (p-value small): → Reject H₀ → Support Hₐ → "Significant" result

If evidence is weak (p-value large): → Fail to reject H₀ → Don't support Hₐ → "Not significant"

Step 8: Why can't we "accept" H₀? We NEVER "accept" or "prove" H₀

Why?

Absence of evidence ≠ evidence of absence
Maybe we just didn't have enough data
Maybe our sample wasn't sensitive enough
Just means: insufficient evidence against H₀

Say "fail to reject H₀" not "accept H₀"

Answer: NULL HYPOTHESIS (H₀): Statement of no effect or no difference, assumed true, contains equality. Represents status quo.

ALTERNATIVE HYPOTHESIS (Hₐ): What we want to show, needs evidence, contains inequality. Represents research question.

We set them up this way to put burden of proof on the researcher - must provide convincing evidence to overturn the assumed status quo. Like "innocent until proven guilty" in law.

3Problem 3medium

❓ Question:

A company claims 40% of customers prefer their product. You survey 200 customers and find 68 prefer it. Test at α = 0.05 level if the true proportion differs from 40%.

💡 Show Solution

Step 1: Set up hypotheses H₀: p = 0.40 (claim is true) Hₐ: p ≠ 0.40 (proportion differs)

This is TWO-TAILED (≠)

Step 2: Check conditions n = 200, p₀ = 0.40

Random: Assume random survey ✓ Normal: np₀ = 200(0.40) = 80 ≥ 10 ✓ n(1-p₀) = 200(0.60) = 120 ≥ 10 ✓ Independent: 200 ≤ 0.10N (assume) ✓

Step 3: Calculate sample proportion p̂ = 68/200 = 0.34

Step 4: Calculate test statistic z = (p̂ - p₀)/√(p₀(1-p₀)/n) = (0.34 - 0.40)/√(0.40(0.60)/200) = -0.06/√(0.24/200) = -0.06/√0.0012 = -0.06/0.0346 ≈ -1.73

Step 5: Find p-value (two-tailed) From z-table: P(Z < -1.73) ≈ 0.0418

Two-tailed p-value: p-value = 2 × 0.0418 = 0.0836

Step 6: Compare to α p-value = 0.0836 α = 0.05

Is 0.0836 < 0.05? NO

Step 7: Make decision Since p-value > α: FAIL TO REJECT H₀

Step 8: State conclusion At the α = 0.05 significance level, there is insufficient evidence to conclude that the true proportion differs from 40%.

The observed 34% could reasonably occur by chance if the true proportion is 40%.

Step 9: Interpret p-value p-value = 0.0836 means:

If true proportion really is 40%, there's an 8.36% chance of getting a sample proportion as extreme as 34% (or more extreme) just by random chance.

Since this is > 5%, not unusual enough to reject claim.

Answer: Test statistic: z = -1.73 P-value: 0.084 Decision: Fail to reject H₀ at α = 0.05 Conclusion: Insufficient evidence that proportion differs from 40%

4Problem 4medium

❓ Question:

What is a p-value? Interpret a p-value of 0.032 in the context of testing H₀: μ = 100 vs Hₐ: μ > 100.

💡 Show Solution

Step 1: Define p-value P-value: Probability of getting results as extreme as (or more extreme than) what we observed, ASSUMING H₀ IS TRUE.

In symbols: p-value = P(getting our data or more extreme | H₀ is true)

Step 2: What "extreme" means Depends on Hₐ:

For Hₐ: μ > 100 (right-tailed): "Extreme" = values ≥ observed

For Hₐ: μ < 100 (left-tailed): "Extreme" = values ≤ observed

For Hₐ: μ ≠ 100 (two-tailed): "Extreme" = values in both tails

Step 3: Interpret p-value = 0.032 Context: H₀: μ = 100, Hₐ: μ > 100

Interpretation: "If the true mean really is 100, there is a 3.2% chance of getting a sample mean as large as (or larger than) what we observed, just by random sampling variability."

Step 4: What this tells us p = 0.032 = 3.2% is fairly small

Means:

Our result is somewhat unusual under H₀
Would rarely happen if H₀ true
Evidence against H₀
Sample mean is higher than expected

Step 5: Making a decision Compare to significance level α

If α = 0.05: p = 0.032 < 0.05 → REJECT H₀ → Statistically significant → Evidence that μ > 100

If α = 0.01: p = 0.032 > 0.01 → FAIL TO REJECT H₀ → Not significant at 0.01 level → Insufficient evidence

Step 6: Common misconceptions P-value is NOT: ✗ Probability that H₀ is true ✗ Probability that Hₐ is true ✗ Probability results are due to chance ✗ Probability of making an error

P-value IS: ✓ Probability of data given H₀ ✓ How surprising data is under H₀ ✓ Measure of evidence against H₀

Step 7: The logic Small p-value (like 0.032): → Data unlikely if H₀ true → Either: a) H₀ is true and we got unlucky, OR b) H₀ is false → More reasonable to conclude H₀ is false → Reject H₀

Large p-value (like 0.50): → Data common if H₀ true → Consistent with H₀ → No reason to doubt H₀ → Fail to reject H₀

Step 8: Strength of evidence P-value scale (rough guideline):

p > 0.10: Little/no evidence against H₀ p = 0.05 to 0.10: Weak evidence against H₀ p = 0.01 to 0.05: Moderate evidence against H₀ p = 0.001 to 0.01: Strong evidence against H₀ p < 0.001: Very strong evidence against H₀

Our p = 0.032: Moderate evidence against H₀

Step 9: Full interpretation for our problem p-value = 0.032

"Assuming the true mean is 100, there is only a 3.2% probability of obtaining a sample mean as large as (or larger than) what we observed. Since this probability is small (less than our significance level of 0.05), we have sufficient evidence to reject the null hypothesis and conclude that the true mean is greater than 100."

Answer: A p-value is the probability of getting results as extreme as what we observed, assuming H₀ is true.

P-value = 0.032 means: If μ really equals 100, there's only a 3.2% chance of getting a sample mean as large as (or larger than) ours. This is fairly unlikely, providing moderate evidence against H₀. At α = 0.05, we would reject H₀ and conclude μ > 100.

5Problem 5hard

❓ Question:

A researcher finds p = 0.048 when testing H₀: μ₁ = μ₂ vs Hₐ: μ₁ ≠ μ₂. At α = 0.05, what decision is made? What if α = 0.01? Explain the relationship between p-value, α, and the decision.

💡 Show Solution

Step 1: The decision rule General rule:

If p-value < α → REJECT H₀
If p-value ≥ α → FAIL TO REJECT H₀

The significance level α is our cutoff!

Step 2: Decision at α = 0.05 p-value = 0.048 α = 0.05

Is 0.048 < 0.05? YES

Decision: REJECT H₀

Conclusion: At the 0.05 significance level, there IS sufficient evidence that the means differ (μ₁ ≠ μ₂).

Step 3: Decision at α = 0.01 p-value = 0.048 α = 0.01

Is 0.048 < 0.01? NO

Decision: FAIL TO REJECT H₀

Conclusion: At the 0.01 significance level, there is NOT sufficient evidence that the means differ.

Step 4: Why different decisions? α = significance level = "how much evidence we require"

α = 0.05 (5%):

Willing to accept more risk
Less stringent standard
Easier to reject H₀

α = 0.01 (1%):

Want stronger evidence
More stringent standard
Harder to reject H₀

Our p = 0.048 (4.8%):

Strong enough for 5% standard ✓
Not strong enough for 1% standard ✗

Step 5: Understanding α α represents:

Maximum acceptable error rate
How rare results must be to reject H₀
Probability of Type I error (rejecting true H₀)

Common values:

α = 0.05 (most common)
α = 0.01 (more conservative)
α = 0.10 (less conservative)

Step 6: The relationship Think of α as a threshold:

p-value = strength of evidence against H₀ α = required strength to reject H₀

If p-value < α: Evidence is strong enough → reject H₀

If p-value ≥ α: Evidence not strong enough → fail to reject H₀

Step 7: Borderline case p = 0.048 is borderline!

Just barely significant at 0.05
Not significant at 0.01

Shows importance of:

Choosing α BEFORE seeing data
Not treating 0.05 as magic cutoff
Reporting actual p-value

Better to report: "p = 0.048" than just "significant" Lets reader judge strength of evidence

Step 8: Multiple comparisons Same data, different standards:

At α = 0.10: 0.048 < 0.10 → Reject H₀ ✓ At α = 0.05: 0.048 < 0.05 → Reject H₀ ✓ At α = 0.01: 0.048 > 0.01 → Fail to reject ✗

This doesn't mean results are contradictory! Just means: evidence moderate, not overwhelming

Step 9: Practical interpretation p = 0.048 means:

About 4.8% chance of this data if H₀ true
Moderate evidence against H₀
Results fairly unlikely under H₀
Probably a real difference, but not certain

Should we be confident?

Depends on context
Depends on consequences of error
Consider practical significance too

Step 10: Fixed vs reported p-value CORRECT approach:

Choose α before collecting data
Collect data
Calculate p-value
Compare to α
Make decision

INCORRECT approach:

Collect data
Calculate p-value
Choose α to get desired result This is p-hacking!

Answer: AT α = 0.05: REJECT H₀ (p = 0.048 < 0.05) Sufficient evidence that means differ.

AT α = 0.01: FAIL TO REJECT H₀ (p = 0.048 > 0.01) Insufficient evidence at this stricter standard.

RELATIONSHIP: α is the threshold for decision. If p-value < α, evidence is strong enough to reject H₀. The same data can lead to different decisions depending on how stringent our evidence requirement (α) is. Always choose α before seeing data!

🎴

Practice with Flashcards

Review key concepts with our flashcard system

📖

Browse All Topics

Explore other calculus topics