june 10, 2008stat 111 - lecture 9 - proportions1 introduction to inference sampling distributions...

23
June 10, 2008 Stat 111 - Lecture 9 - Proportions 1 Introduction to Inference Sampling Distributions for Counts and Proportions Statistics 111 - Lecture 9

Upload: ambrose-mathews

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

June 10, 2008 Stat 111 - Lecture 9 - Proportions 1

Introduction to Inference

Sampling Distributions

for Counts and Proportions

Statistics 111 - Lecture 9

June 10, 2008 Stat 111 - Lecture 9 - Proportions 2

Administrative Notes

• Homework 3 is due on Monday, June 15th – Covers chapters 1-5 in textbook

• Exam on Monday, June 15th • Review session on Thursday

June 10, 2008 Stat 111 - Lecture 9 - Proportions 3

Last Class• Focused on models for continuous data: using the

sample mean as our estimate of population mean

• Sampling Distributionof the Sample Mean• how does the sample mean change over different samples?

PopulationParameter:

Distributionof thesevalues?

Sample 1 of size n xSample 2 of size n xSample 3 of size n xSample 4 of size n xSample 5 of size n xSample 6 of size n x

.

. .

June 10, 2008 Stat 111 - Lecture 9 - Proportions 4

Today’s Class

• We will now focus on count data: categorical data that takes on only two different values

“Success” (Yi = 1) or “Failure” (Yi = 0)

• Goal is to estimate population proportion:

p = proportion of Yi = 1 in population

June 10, 2008 Stat 111 - Lecture 9 - Proportions 5

Examples

• Gender: our class has 83 women and 42 men • What is proportion of women in Penn student

population?

• Presidential Election: out of 2000 people sampled, 1150 will vote for McCain in upcoming election• What proportion of total population will vote for

McCain?

• Quality Control: Inspection of a sample of 100 microchips from a large shipment shows 10 failures• What is proportion of failures in all shipments?

June 10, 2008 Stat 111 - Lecture 9 - Proportions 6

Inference for Count Data

• Goal for count data is to estimate the population proportion p

• From a sample of size n, we can calculate two statistics:1. sample count Y2. sample proportion = Y/n

• Use sample proportion as our estimate of population proportionp

• Sampling Distributionof the Sample Proportion• how does sample proportion change over different samples?

PopulationParameter: p

Distributionof thesevalues?

Sample 1 of size n xSample 2 of size n xSample 3 of size n xSample 4 of size n xSample 5 of size n xSample 6 of size n x

.

. .

June 10, 2008 Stat 111 - Lecture 9 - Proportions 7

The Binomial Setting for Count Data

1. Fixed number n of observations (or trials)

2. Each observation is independent

3. Each observation falls into 1 of 2 categories:1. Success (Y = 1) or Failure (Y = 0)

4. Each observation has the same probability of success: p = P(Y = 1)

June 10, 2008 Stat 111 - Lecture 9 - Proportions 8

Binomial Distribution for Sample Count

• Sample count Y (number of Yi=1 in sample of size n) has a Binomial distribution

• The binomial distribution has two parameters:• number of trials n and population proportion p

P(X=k) = nCk * pk (1-p)(n-k)

• Binomial formula accounts for• number of success: pk

• number of failures : (1-p)n-k

• different orders of success/failures: nCk = n!/(k!(n-k)!)

June 10, 2008 Stat 111 - Lecture 9 - Proportions 9

Binomial Probability Histogram• Can make histogram out of these probabilities

• Can add up bars of histogram to get any probability we want: eg. P(Y < 4)

• Different values of n and p have different histograms, but Table C in book has probabilities for many values of n and p

June 10, 2008 Stat 111 - Lecture 9 - Proportions 10

Binomial Table

June 10, 2008 Stat 111 - Lecture 9 - Proportions 11

Example: Genetics• If a couple are both carriers of a certain

disease, then their children each have probability 0.25 of being born with disease

• Suppose that the couple has 4 children• P(none of their children have the disease)?

P(X=0) = 4!/(0!*4!) * .250 * (1-.25)4

• P(at least two children have the disease)?P(Y ≥ 2) = P(Y = 2) +P(Y = 3) +P(Y = 4)

= 0.2109 +0.0469 +0.0039 (from table)

= 0.2617

June 10, 2008 Stat 111 - Lecture 9 - Proportions 12

Example: Quality Control

• A worker inspects a sample of n=20 microchips from a large shipment

• The probability of a microchip being faulty is 10% (p = 0.10)

• What is the probability that there are less than three failures in the sample?

P(Y < 3) = P(Y = 0) + P(Y =1) + P(Y = 2)

= 0.1216 + 0.2702 + 0.2852 (from table)

= 0.677

June 10, 2008 Stat 111 - Lecture 9 - Proportions 13

Sample Proportions• Usually, we are more interested in a sample

proportion = Y/n instead of a sample count

P ( < k ) = P( Y < n*k)• Example: a worker inspects a sample of 20

microchips from a large shipment with probability of a microchip being faulty is 0.1

• What is the probability that our sample proportion of faulty chips is less than 0.05?

• P ( < .05 ) = P( Y < 1) = P(Y=0) = .1216

0.05 x 20

June 10, 2008 Stat 111 - Lecture 9 - Proportions 14

Mean and Variance of Binomial Counts

• If our sample count Y is a random variable with a Binomial distribution, what is the mean and variance of Y across all samples?• Useful since we only observe the value of Y for our

sample but what are the values in other samples?

• We can calculate the mean and variance of a Binomial distribution with parameters n and p:

μY = n*p

σ2 = n*p*(1-p)

σ = √ (n*p*(1-p))

June 10, 2008 Stat 111 - Lecture 9 - Proportions 15

Mean/Variance of Binomial Proportions

• Sample proportion is a linear transformation of the sample count ( = Y/n )

μ = 1/n * mean(Y) = 1/n * np = p

• Mean of sample proportion is true probability of success p

σ2 = 1/n2 Var(Y) = 1/n2 * n*p*(1-p) = p(1-p)/n

• Variance of sample proportion decreases as sample size n increases!

June 10, 2008 Stat 111 - Lecture 9 - Proportions 16

Variance over Long-Run• Lower variance with larger sample size means that

sample proportion will tend to be closer to population mean in larger samples

• Long-run behaviour of two different coin tossing runs. Much less likely to get unexpected events in larger samples

June 10, 2008 Stat 111 - Lecture 9 - Proportions 17

Binomial Probabilities in Large Samples• In large samples, it is often tedious to calculate

probabilities using the binomial distribution• Example: Gallup poll for presidential election• Bush has 49% of vote in population. What is the

probability that Bush gets a count over 550 in a sample of 1000 people?

P(Y > 550) = P(Y = 551) + P(Y = 552) + … + P(Y =1000)

= 450 terms to look up in the table!

• We can instead use the fact that for large samples, the Binomial distribution is closely approximated by the Normal distribution

June 10, 2008 Stat 111 - Lecture 9 - Proportions 18

June 10, 2008 Stat 111 - Lecture 9 - Proportions 19

Normal Approximation to Binomial• If count Y follows a binomial distribution with

parameters n and p, then Y approximately follows a Normal distribution with mean and variance:

μY = n*p

• This approximation is only good if n is “large enough”. • Rule of thumb for “large enough”:n·p≥ 10 and n(1-p) ≥ 10

• Also works for sample proportion: = Y/n follows a Normal distribution with mean and variance

June 10, 2008 Stat 111 - Lecture 9 - Proportions 20

Example: Quality Control• Sample of 100 microchips (with usual 10% of

microchips are faulty. What is the probability there are at least 17 bad chips in our sample?

• Using Binomial calculation/table is tedious. Instead use Normal approximation:

• Mean = n·p = 1000.10 = 10

• Var = n·p·(1-p) = 1000.100.90 = 9

= P(Z ≥ 2.33)

=1- P(Z ≤ 2.33)

= 0.01 (from table)

June 10, 2008 Stat 111 - Lecture 9 - Proportions 21

Example: Gallup Poll• Bush has 49% of vote in population• What is the probability that Bush gets sample

proportion over 0.51 in sample of size 1000? • Use normal distribution with

mean = p = 0.49 and variance p·(1-p)/n = 0.000245

= P(Z ≥1.27) =1- P(Z ≤1.27)

= 0.102

June 10, 2008 Stat 111 - Lecture 9 - Proportions 22

Why does Normal Approximation work?

• Central Limit Theorem: in large samples, the distribution of the sample mean is approx. Normal

• Well, our count data takes on two different values:“Success” (Yi = 1) or “Failure” (Yi = 0)

• The sample proportion is the same as the sample mean for count data!

• So, Central Limit Theorem works for sample proportions as well!

June 10, 2008 Stat 111 - Lecture 9 - Proportions 23

Next Class - Lecture 10

• Review session on Wednesday/Thursday– Show up with questions!