Scatter Plots and Correlation

Visualizing and measuring linear relationships

Scatterplots and Correlation

Scatterplots

Scatterplot: Graph showing relationship between two quantitative variables

  • x-axis: Explanatory variable (independent)
  • y-axis: Response variable (dependent)
  • Each point represents one individual

Purpose: Visualize relationship, identify patterns, detect outliers

Describing Scatterplots: DCFS

Direction: Positive, negative, or no association

Positive: As x increases, y tends to increase
Negative: As x increases, y tends to decrease
No association: No clear pattern

Cluster: Data grouped together or spread evenly

Form: Linear or nonlinear

Linear: Points follow straight-line pattern
Nonlinear: Curved pattern (quadratic, exponential, etc.)

Strength: How closely points follow pattern

Strong: Points close to pattern
Moderate: Some scatter but clear pattern
Weak: Lots of scatter, vague pattern

Outliers: Points far from overall pattern

Correlation Coefficient (r)

Measures: Strength and direction of linear relationship

Formula:

r=1n1(xixˉsx)(yiyˉsy)r = \frac{1}{n-1} \sum \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Properties:

  • Range: -1 ≤ r ≤ 1
  • r = 1: Perfect positive linear relationship
  • r = -1: Perfect negative linear relationship
  • r = 0: No linear relationship
  • r > 0: Positive association
  • r < 0: Negative association

Interpreting |r|

|r| = 1: Perfect linear relationship
0.8 < |r| < 1: Strong linear relationship
0.5 < |r| < 0.8: Moderate linear relationship
0 < |r| < 0.5: Weak linear relationship
|r| = 0: No linear relationship

Note: These are rough guidelines, context matters!

Important Properties of r

1. Unitless: No units (standardized)

2. Not affected by units: Converting x or y doesn't change r

3. Not affected by which variable is x or y: Switching variables doesn't change r

4. Affected by outliers: Single outlier can dramatically change r

5. Measures linear relationship only: Can be 0 even if strong nonlinear relationship exists!

Example: Calculating r

Data: (1, 2), (2, 4), (3, 5), (4, 7), (5, 8)

xˉ=3\bar{x} = 3, sx=1.58s_x = 1.58
yˉ=5.2\bar{y} = 5.2, sy=2.39s_y = 2.39

r=14[(2/1.58)(3.2/2.39)+...+(2/1.58)(2.8/2.39)]r = \frac{1}{4}[(-2/1.58)(-3.2/2.39) + ... + (2/1.58)(2.8/2.39)]

r0.982r \approx 0.982 (strong positive)

In practice: Use calculator!

Calculator Method

TI-83/84:

  1. Enter x-values in L1, y-values in L2
  2. STAT → CALC → 8:LinReg(a+bx)
  3. r appears (if diagnostics on: 2nd 0 → DiagnosticOn)

Correlation vs Causation

CRITICAL: Correlation does NOT imply causation!

r = 0.9 means:

  • Strong linear relationship exists
  • x and y tend to vary together

r = 0.9 does NOT mean:

  • x causes y
  • y causes x

Possible explanations for correlation:

  1. x causes y
  2. y causes x
  3. Confounding variable causes both
  4. Coincidence

Example: Spurious Correlation

Ice cream sales and drowning deaths: r ≈ 0.9

NOT because:

  • Ice cream causes drowning
  • Drowning causes ice cream sales

ACTUALLY:

  • Confounding variable: Summer/temperature
  • Both increase in summer

Outliers and Influential Points

Outlier: Point far from overall pattern

Effect on r:

  • Can increase or decrease r
  • Can change sign of r
  • Single outlier can dominate

Always: Identify outliers, consider their impact

Influential point: If removed, would substantially change r or regression line

When Correlation Inappropriate

Don't use r if:

  1. Relationship is nonlinear (r only measures linear!)
  2. Severe outliers present (distort r)
  3. Categorical variables (need different analysis)

Always plot data first! Don't rely on r alone.

Describing Associations

Template: "There is a [direction] [form] [strength] association between [x] and [y]."

Example: "There is a strong positive linear association between study hours and test scores."

Add: "With no outliers" or "With one outlier at..."

Association vs Relationship

Association: Variables vary together (correlation)

Relationship: Generic term (could be causal or not)

Causation: x directly causes changes in y

Always distinguish!

Quick Reference

DCFS: Direction, Cluster, Form, Strength (+ outliers)

Correlation r:

  • Range: -1 to 1
  • Measures linear relationship only
  • Unitless
  • Affected by outliers

Key: Correlation ≠ Causation

Remember: Always make scatterplot first! r alone can be misleading. A nonlinear relationship might have r ≈ 0 but still be strongly related!

📚 Practice Problems

No example problems available yet.