Types of Data and Sampling

Categorical vs quantitative data, sampling methods

Types of Data and Sampling

Introduction

Statistics is the science of collecting, organizing, analyzing, and interpreting data. Understanding the different types of data and proper sampling methods is fundamental to conducting valid statistical analyses.

Types of Data

Categorical vs. Quantitative

Categorical (Qualitative) Data:

Describes characteristics or qualities
Places individuals into categories
Cannot be measured numerically in a meaningful way

Examples:

Eye color (blue, brown, green)
Political party (Democrat, Republican, Independent)
Type of car (sedan, SUV, truck)
Opinion rating (agree, neutral, disagree)

Quantitative (Numerical) Data:

Consists of numerical measurements or counts
Can be added, averaged, or otherwise manipulated mathematically

Examples:

Height (68 inches, 72 inches)
Test score (85, 92, 78)
Number of siblings (0, 1, 2, 3)
Temperature (72°F, 85°F)

Discrete vs. Continuous

Within quantitative data, we distinguish:

Discrete Data:

Countable values
Usually whole numbers
Often from counting

Examples:

Number of students in a class (25, 30, 18)
Number of cars owned (0, 1, 2, 3)
Number of errors on a test (2, 5, 0)

Continuous Data:

Can take any value in an interval
Usually from measuring
Infinite possible values between any two points

Examples:

Height (5.7 feet, 5.75 feet, 5.752 feet...)
Weight (142.3 lbs, 142.35 lbs...)
Time (3.2 seconds, 3.25 seconds...)

Levels of Measurement

Understanding the level of measurement helps determine appropriate statistical analyses.

Nominal

Characteristics:

Categories with no inherent order
Most basic level
Can only count frequencies

Examples:

Blood type (A, B, AB, O)
Gender (male, female, non-binary)
Favorite color (red, blue, green)

Valid operations: Count, mode

Ordinal

Characteristics:

Categories with meaningful order
Differences between ranks not necessarily equal
Cannot measure exact distance between values

Examples:

Class rank (1st, 2nd, 3rd)
Letter grades (A, B, C, D, F)
Satisfaction rating (very satisfied, satisfied, neutral, dissatisfied)

Valid operations: Count, mode, median

Interval

Characteristics:

Numerical scale with equal intervals
No true zero point
Zero doesn't mean "absence of"

Examples:

Temperature in Celsius or Fahrenheit (0°F doesn't mean "no temperature")
IQ scores
Calendar years (year 0 is arbitrary)

Valid operations: Count, mode, median, mean, addition/subtraction

Ratio

Characteristics:

Numerical scale with equal intervals
Has true zero point
Zero means complete absence
Can form ratios (twice as much, half as big)

Examples:

Height (0 inches = no height)
Weight (0 lbs = no weight)
Age (0 years = newborn)
Income (0 dollars = no money)

Valid operations: All mathematical operations

Populations vs. Samples

Population

Definition: The entire group of individuals or items we want to study

Characteristics:

Complete collection
Often too large or expensive to study completely
Denoted by $N$ for size

Examples:

All students in the United States
All adults registered to vote in California
Every car manufactured by Toyota in 2024

Parameters: Numerical characteristics of populations

Population mean: $\mu$ (mu)
Population standard deviation: $\sigma$ (sigma)
Population proportion: $p$

Sample

Definition: A subset of the population, selected for study

Characteristics:

Representative portion of population
Practical and economical to study
Denoted by $n$ for size

Examples:

500 randomly selected U.S. students
1,000 California voters surveyed
100 Toyota cars tested from 2024 production

Statistics: Numerical characteristics of samples

Sample mean: $\bar{x}$ (x-bar)
Sample standard deviation: $s$
Sample proportion: $\hat{p}$ (p-hat)

Key relationship: We use statistics from samples to make inferences about parameters of populations.

Sampling Methods

Random Sampling

Simple Random Sample (SRS):

Every individual has equal chance of selection
Every group of size $n$ has equal chance
"Gold standard" of sampling

How to obtain:

Assign numbers to all population members
Use random number generator
Select corresponding individuals

Example: Put all 500 student names in a hat, mix thoroughly, draw 50 names

Advantages:

Unbiased
Simple to understand
Known probability of selection

Disadvantages:

Requires complete list of population
May not represent subgroups well
Can be impractical for large populations

Stratified Random Sample

Method:

Divide population into homogeneous groups (strata)
Take SRS from each stratum
Combine samples

Example: Divide school by grade level (9th, 10th, 11th, 12th), randomly sample 25 students from each grade

When to use:

Want to ensure representation of subgroups
Strata are internally similar but different from each other
Interested in comparing groups

Advantages:

Guarantees representation from each stratum
More precise estimates
Can compare strata

Disadvantages:

Requires knowledge of population characteristics
More complex than SRS

Cluster Sample

Method:

Divide population into groups (clusters)
Randomly select some clusters
Study ALL individuals in selected clusters

Example: Divide city into neighborhoods (clusters), randomly select 5 neighborhoods, survey all households in those 5

When to use:

No complete population list available
Geographically dispersed population
Cost-effective approach needed

Advantages:

Practical and economical
No need for complete population list
Reduces travel/contact costs

Disadvantages:

Less precise than SRS
Clusters should be heterogeneous (like mini-populations)

Systematic Sample

Method:

Select every $k$ th individual from list
Random starting point
$k = \frac{N}{n}$ (population size / sample size)

Example: From 1000 students, select every 10th student (random start between 1-10), get sample of 100

When to use:

Have organized list
Want easy implementation
Population not cyclical

Advantages:

Simple to implement
Spreads sample across population
Often as good as SRS

Disadvantages:

Problems if list has hidden patterns
Not truly random

Sampling Bias

Types of Bias

Selection Bias:

Some individuals more likely to be selected
Sample not representative of population

Example: Surveying only people in shopping mall (excludes those who don't shop there)

Voluntary Response Bias:

Individuals choose to participate
Often those with strong opinions respond

Example: Online poll where anyone can vote (those who care most will participate)

Undercoverage:

Some groups systematically excluded
Sampling frame incomplete

Example: Phone survey excludes those without phones

Nonresponse Bias:

Selected individuals don't respond
Respondents differ from non-respondents

Example: Survey with 20% response rate (80% non-response)

Best Practices

For Valid Sampling:

✓ Use random selection when possible
✓ Define population clearly
✓ Ensure sampling frame matches population
✓ Minimize nonresponse
✓ Watch for sources of bias
✓ Use stratification when subgroups matter
✓ Make sample size adequate for precision needed

Common Mistakes to Avoid:

❌ Convenience sampling (just because it's easy)
❌ Voluntary response (self-selection bias)
❌ Assuming bigger is always better (quality > quantity)
❌ Ignoring nonresponse
❌ Using outdated sampling frame

Quick Reference

Data Type Decision Tree:

Is it numerical? → Quantitative (otherwise Categorical)
Can it be counted? → Discrete (otherwise Continuous)
Does it have true zero? → Ratio (otherwise Interval)

Sampling Method Selection:

Want simplicity and have complete list → SRS
Need to ensure subgroup representation → Stratified
Population spread out geographically → Cluster
Have organized list, want efficiency → Systematic

Remember: Good sampling is the foundation of valid statistical inference. A biased sample, no matter how large, leads to invalid conclusions!

📚 Practice Problems

1Problem 1easy

❓ Question:

Classify each variable as categorical or quantitative: a) Eye color of students b) Number of siblings c) Brand of smartphone d) Height in centimeters

💡 Show Solution

Step 1: Understand the distinction Categorical: Places individuals into groups/categories Quantitative: Takes numerical values with meaningful operations

Step 2: Analyze each variable a) Eye color: Categories (blue, brown, green, etc.) → CATEGORICAL b) Number of siblings: Numerical count, can calculate average → QUANTITATIVE c) Brand of smartphone: Categories (Apple, Samsung, etc.) → CATEGORICAL d) Height in centimeters: Numerical measurement, can calculate mean → QUANTITATIVE

Answer: a) Categorical b) Quantitative c) Categorical d) Quantitative

2Problem 2easy

❓ Question:

A survey asks students: "Rate your satisfaction with the cafeteria food on a scale of 1-5." Is this categorical or quantitative? Explain.

💡 Show Solution

Step 1: Analyze the data type Scale: 1-5 (numbers are used)

Step 2: Consider the nature of the scale

Numbers represent categories of satisfaction (very unsatisfied → very satisfied)
The numbers are ordinal (ordered categories)
Differences between numbers aren't necessarily equal
Can't meaningfully say "2 is twice as satisfied as 1"

Step 3: Classify This is CATEGORICAL (specifically ordinal categorical data)

Even though numbers are used, they represent categories
The numbers are labels for satisfaction levels
Also called "Likert scale" data

Note: Some statisticians treat ordinal data as quantitative in certain contexts, but strictly speaking, it's categorical with an order.

Answer: Categorical (ordinal)

3Problem 3medium

❓ Question:

Identify whether each sampling method is: Simple Random Sample (SRS), Stratified, Cluster, or Systematic. a) Select every 10th person entering a store b) Divide school by grade level, then randomly select students from each grade c) Randomly select 5 classrooms and survey all students in those classrooms

💡 Show Solution

Step 1: Review sampling methods SRS: Every individual has equal probability Stratified: Divide into groups (strata), sample from each Cluster: Divide into groups, randomly select some groups, use ALL from selected Systematic: Select every kth individual

Step 2: Classify each method

a) Every 10th person Pattern: Select at regular intervals This is SYSTEMATIC sampling

b) Divide by grade, sample from each Pattern: Create homogeneous groups (grades), sample from ALL groups This is STRATIFIED sampling

c) Select 5 classrooms, survey all students Pattern: Groups (clusters) selected, then ALL within those groups surveyed This is CLUSTER sampling

Answer: a) Systematic b) Stratified c) Cluster

4Problem 4medium

❓ Question:

A researcher wants to estimate the average income in a city. She divides the city into neighborhoods based on property values (low, medium, high), then randomly samples 50 households from each neighborhood. What type of sampling is this, and why might it be better than a simple random sample?

💡 Show Solution

Step 1: Identify the sampling method Process:

Divide population into groups (neighborhoods by property value)
Sample from EACH group
Use proportional or equal sampling from each stratum

This is STRATIFIED sampling

Step 2: Explain advantages over SRS

Why stratified is better here:

Ensures representation: Guarantees all income levels represented
Reduces variability: Within each stratum, incomes are more similar
Increases precision: Can get more accurate estimates with same sample size
Allows subgroup analysis: Can compare neighborhoods

With SRS:

Might randomly miss low-income or high-income areas
Higher chance of sampling error
Less efficient estimation

Step 3: Statistical benefit Stratified sampling reduces the standard error of the estimate when:

Strata are homogeneous within
Strata are heterogeneous between
Income varies greatly by neighborhood (which it does!)

Answer: Stratified sampling. It's better because it ensures all income levels are represented, reduces sampling variability, and provides more precise estimates than SRS when the population has distinct subgroups.

5Problem 5hard

❓ Question:

A college has 10,000 students: 6,000 freshmen, 2,500 sophomores, 1,000 juniors, and 500 seniors. Design a stratified random sample of 200 students that maintains the same proportions. How many students should be selected from each class?

💡 Show Solution

Step 1: Find the proportion of each class Total students: 10,000

Freshmen: 6,000/10,000 = 0.60 = 60% Sophomores: 2,500/10,000 = 0.25 = 25% Juniors: 1,000/10,000 = 0.10 = 10% Seniors: 500/10,000 = 0.05 = 5%

Step 2: Apply proportions to sample size Sample size: 200 students

Freshmen: 200 × 0.60 = 120 students Sophomores: 200 × 0.25 = 50 students Juniors: 200 × 0.10 = 20 students Seniors: 200 × 0.05 = 10 students

Step 3: Verify total 120 + 50 + 20 + 10 = 200 ✓

Step 4: Verify proportions maintained Freshmen: 120/200 = 60% ✓ Sophomores: 50/200 = 25% ✓ Juniors: 20/200 = 10% ✓ Seniors: 10/200 = 5% ✓

Answer: Freshmen: 120 Sophomores: 50 Juniors: 20 Seniors: 10

🎴

Practice with Flashcards

Review key concepts with our flashcard system

📖

Browse All Topics

Explore other calculus topics