Types of Data and Sampling

Categorical vs quantitative data, sampling methods

Types of Data and Sampling

Introduction

Statistics is the science of collecting, organizing, analyzing, and interpreting data. Understanding the different types of data and proper sampling methods is fundamental to conducting valid statistical analyses.

Types of Data

Categorical vs. Quantitative

Categorical (Qualitative) Data:

  • Describes characteristics or qualities
  • Places individuals into categories
  • Cannot be measured numerically in a meaningful way

Examples:

  • Eye color (blue, brown, green)
  • Political party (Democrat, Republican, Independent)
  • Type of car (sedan, SUV, truck)
  • Opinion rating (agree, neutral, disagree)

Quantitative (Numerical) Data:

  • Consists of numerical measurements or counts
  • Can be added, averaged, or otherwise manipulated mathematically

Examples:

  • Height (68 inches, 72 inches)
  • Test score (85, 92, 78)
  • Number of siblings (0, 1, 2, 3)
  • Temperature (72°F, 85°F)

Discrete vs. Continuous

Within quantitative data, we distinguish:

Discrete Data:

  • Countable values
  • Usually whole numbers
  • Often from counting

Examples:

  • Number of students in a class (25, 30, 18)
  • Number of cars owned (0, 1, 2, 3)
  • Number of errors on a test (2, 5, 0)

Continuous Data:

  • Can take any value in an interval
  • Usually from measuring
  • Infinite possible values between any two points

Examples:

  • Height (5.7 feet, 5.75 feet, 5.752 feet...)
  • Weight (142.3 lbs, 142.35 lbs...)
  • Time (3.2 seconds, 3.25 seconds...)

Levels of Measurement

Understanding the level of measurement helps determine appropriate statistical analyses.

Nominal

Characteristics:

  • Categories with no inherent order
  • Most basic level
  • Can only count frequencies

Examples:

  • Blood type (A, B, AB, O)
  • Gender (male, female, non-binary)
  • Favorite color (red, blue, green)

Valid operations: Count, mode

Ordinal

Characteristics:

  • Categories with meaningful order
  • Differences between ranks not necessarily equal
  • Cannot measure exact distance between values

Examples:

  • Class rank (1st, 2nd, 3rd)
  • Letter grades (A, B, C, D, F)
  • Satisfaction rating (very satisfied, satisfied, neutral, dissatisfied)

Valid operations: Count, mode, median

Interval

Characteristics:

  • Numerical scale with equal intervals
  • No true zero point
  • Zero doesn't mean "absence of"

Examples:

  • Temperature in Celsius or Fahrenheit (0°F doesn't mean "no temperature")
  • IQ scores
  • Calendar years (year 0 is arbitrary)

Valid operations: Count, mode, median, mean, addition/subtraction

Ratio

Characteristics:

  • Numerical scale with equal intervals
  • Has true zero point
  • Zero means complete absence
  • Can form ratios (twice as much, half as big)

Examples:

  • Height (0 inches = no height)
  • Weight (0 lbs = no weight)
  • Age (0 years = newborn)
  • Income (0 dollars = no money)

Valid operations: All mathematical operations

Populations vs. Samples

Population

Definition: The entire group of individuals or items we want to study

Characteristics:

  • Complete collection
  • Often too large or expensive to study completely
  • Denoted by NN for size

Examples:

  • All students in the United States
  • All adults registered to vote in California
  • Every car manufactured by Toyota in 2024

Parameters: Numerical characteristics of populations

  • Population mean: μ\mu (mu)
  • Population standard deviation: σ\sigma (sigma)
  • Population proportion: pp

Sample

Definition: A subset of the population, selected for study

Characteristics:

  • Representative portion of population
  • Practical and economical to study
  • Denoted by nn for size

Examples:

  • 500 randomly selected U.S. students
  • 1,000 California voters surveyed
  • 100 Toyota cars tested from 2024 production

Statistics: Numerical characteristics of samples

  • Sample mean: xˉ\bar{x} (x-bar)
  • Sample standard deviation: ss
  • Sample proportion: p^\hat{p} (p-hat)

Key relationship: We use statistics from samples to make inferences about parameters of populations.

Sampling Methods

Random Sampling

Simple Random Sample (SRS):

  • Every individual has equal chance of selection
  • Every group of size nn has equal chance
  • "Gold standard" of sampling

How to obtain:

  • Assign numbers to all population members
  • Use random number generator
  • Select corresponding individuals

Example: Put all 500 student names in a hat, mix thoroughly, draw 50 names

Advantages:

  • Unbiased
  • Simple to understand
  • Known probability of selection

Disadvantages:

  • Requires complete list of population
  • May not represent subgroups well
  • Can be impractical for large populations

Stratified Random Sample

Method:

  • Divide population into homogeneous groups (strata)
  • Take SRS from each stratum
  • Combine samples

Example: Divide school by grade level (9th, 10th, 11th, 12th), randomly sample 25 students from each grade

When to use:

  • Want to ensure representation of subgroups
  • Strata are internally similar but different from each other
  • Interested in comparing groups

Advantages:

  • Guarantees representation from each stratum
  • More precise estimates
  • Can compare strata

Disadvantages:

  • Requires knowledge of population characteristics
  • More complex than SRS

Cluster Sample

Method:

  • Divide population into groups (clusters)
  • Randomly select some clusters
  • Study ALL individuals in selected clusters

Example: Divide city into neighborhoods (clusters), randomly select 5 neighborhoods, survey all households in those 5

When to use:

  • No complete population list available
  • Geographically dispersed population
  • Cost-effective approach needed

Advantages:

  • Practical and economical
  • No need for complete population list
  • Reduces travel/contact costs

Disadvantages:

  • Less precise than SRS
  • Clusters should be heterogeneous (like mini-populations)

Systematic Sample

Method:

  • Select every kkth individual from list
  • Random starting point
  • k=Nnk = \frac{N}{n} (population size / sample size)

Example: From 1000 students, select every 10th student (random start between 1-10), get sample of 100

When to use:

  • Have organized list
  • Want easy implementation
  • Population not cyclical

Advantages:

  • Simple to implement
  • Spreads sample across population
  • Often as good as SRS

Disadvantages:

  • Problems if list has hidden patterns
  • Not truly random

Sampling Bias

Types of Bias

Selection Bias:

  • Some individuals more likely to be selected
  • Sample not representative of population

Example: Surveying only people in shopping mall (excludes those who don't shop there)

Voluntary Response Bias:

  • Individuals choose to participate
  • Often those with strong opinions respond

Example: Online poll where anyone can vote (those who care most will participate)

Undercoverage:

  • Some groups systematically excluded
  • Sampling frame incomplete

Example: Phone survey excludes those without phones

Nonresponse Bias:

  • Selected individuals don't respond
  • Respondents differ from non-respondents

Example: Survey with 20% response rate (80% non-response)

Best Practices

For Valid Sampling:

Use random selection when possible
Define population clearly
Ensure sampling frame matches population
Minimize nonresponse
Watch for sources of bias
Use stratification when subgroups matter
Make sample size adequate for precision needed

Common Mistakes to Avoid:

❌ Convenience sampling (just because it's easy)
❌ Voluntary response (self-selection bias)
❌ Assuming bigger is always better (quality > quantity)
❌ Ignoring nonresponse
❌ Using outdated sampling frame

Quick Reference

Data Type Decision Tree:

  1. Is it numerical? → Quantitative (otherwise Categorical)
  2. Can it be counted? → Discrete (otherwise Continuous)
  3. Does it have true zero? → Ratio (otherwise Interval)

Sampling Method Selection:

  • Want simplicity and have complete list → SRS
  • Need to ensure subgroup representation → Stratified
  • Population spread out geographically → Cluster
  • Have organized list, want efficiency → Systematic

Remember: Good sampling is the foundation of valid statistical inference. A biased sample, no matter how large, leads to invalid conclusions!

📚 Practice Problems

No example problems available yet.