sampling distribution models

60
Sampling Distribution Models

Upload: kairos

Post on 22-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Sampling Distribution Models. Ghosts. In November 2005, the Harris Poll asked 889 US adults “Do you believe in ghosts?” 40% said they did At almost the same time, CBS polled 808 US adults and asked the same question. 48% said yes. Why the difference? - PowerPoint PPT Presentation

TRANSCRIPT

Sampling Distribution Models

Sampling Distribution ModelsGhostsIn November 2005, the Harris Poll asked 889 US adults Do you believe in ghosts?

40% said they did

At almost the same time, CBS polled 808 US adults and asked the same question.

48% said yesWhy the difference?

This question lies at the heart of statistics.

GhostsTo understand the CBS poll, it is time to not only Think, Show, and Tell but to also Imagine.

Let us imagine the results from all the random samples of size 808 that the CBS News didnt take. What would the histogram of all the sample proportions look like?Start with the center:

For peoples belief in ghosts, where would you expect the center to be?

Of course, we dont know the answer to that (and probably never will). But we know that it will be at the true proportion in the population and we call that p.GhostsFor the sake of discussion here, lets suppose that 45% of all American adults believe in ghosts, so we will use p = 0.45

How about the shape of the histogram?

We dont have to just imagine that one.We can simulate a bunch of random samples that we didnt really draw.

GhostsHere is a histogram of the proportions saying they believe in ghosts for 2000 simulated independent samples of 808 adults when the true proportion is p = 0.45

Sampling DistributionsThe histogram above is a simulation of what we would get if we could see all the proportions from all possible samples

That distribution has a special name. It is called the sampling distribution of proportions.NOTE!

Up until now we have been dealing with sample distributions

Now we are working with sampling distributions.

Similar name, VERY different.Sampling Distribution ModelA sampling distribution model for how a sample proportion varies from sample to sample allows us to quantify that variation and to talk about how likely it is that wed observe a sample proportion in any particular interval.The histogram is unimodal and symmetric. In fact, it is an amazing and fortunate fact that a Normal Model is just the right one for the histogram of sample proportions.Sampling Distribution ModelWhat about standard deviation?

Usually the mean gives us no information about the standard deviation. Suppose we told you that a batch of bike helmets had a mean diameter of 26cm and asked what the SD was. If you said, I have no idea you would be exactly right.Sampling Distribution ModelWe have a special fact about proportions!

With proportions we get something for free.

Once we know the mean, p, we automatically also know the SDSampling

Sampling Distribution Models

Sampling Distribution Models Because we have a normal model we can use the 68-95-99.7 Rule to look up other probabilities using a table or technology

For example, we should not be surprised if 95% of various polls gave results that were near 45% but varied above and below that by no more than two standard deviations.How Good is the Normal Model?A histogram of the sample proportions can be modeled well by a Normal model.

The catch, though, is that the sample must be large enough.

Assumptions and ConditionsThe Independence Assumption:

The sampled values must be independent of each other

The Sample Size Assumption:

The sample size, n, must be large enoughAssumptions and ConditionsRandomization Condition

If your data come from an experiment, subjects should have been randomly assigned to treatments.

If you have a survey, your sample should be a simple random sample of the population.

If some other sampling design was used, be sure the sampling method was not biased and that the data are representative of the population.10% Condition

The sample size, n, must be no larger than 10% of the population. For national polls, the total population is usually very large, so the sample is a small fraction of the population.

Assumptions and ConditionsSuccess/Failure Condition

The sample size has to be big enough so that we expect at least 10 success and at least 10 failures.

When np and nq are at least 10, we have enough data for sound conclusions. For the CBS survey, a success might be believing in ghosts. With p = 0.45 we expect 808 * 0.45 = 364 successes and 808 * 0.65 = 444 failures.

So, we certainly expect enough successes and enough failures for the conditions to be satisfied.What Is Going On?We can think of the sample proportion as a random variable taking on a different value in each random sample

Then we can say something specific about the distribution of those valuesThis is a fundamental insight one we will use in the next four chapters

Sampling Distribution ModelNo longer is a proportion something we just compute for a set of data.

We now see it as a random variable quantity that has a probability distribution, and we can model that distribution.We call that the sampling distribution model for the proportionSampling Distribution Model for a ProportionSampling Distribution ModelWithout the sampling distribution model the rest of statistics just wouldnt exist.

Sampling models are what makes statistics work

They inform us about the amount of variation we should expect when we sample.CoinsSuppose we flip a coin 100 times in order to decide whether its fair or not.

If we get 52 heads, we probably arent surprised.

Yes, we expect 50, but 52 doesnt seem particularly unusual for a fair coin.But we would be surprised to see 90 heads.

How about 64?

Harder to say.

Here we need a sampling distribution model.Sampling ModelSampling models quantify the variability, telling us how surprising any sample proportion is.

It enables us to make informed decisions about how precise our estimate of the true proportion might be.That is exactly what we will be doing for the rest of the course.Sampling ModelsSampling distribution models act as a bridge from the real world of data to the imaginary model of the statistic and enable us to say something about the population when all we have is data from the real worldThis is the huge leap of statistics

We move from the sample we did take to imagine all the samples we might have taken and what proportions we would have seen.

This is the margin of error you hear about in surveys and polls.ExampleThe CDC reports that 22% of 18 year old women in the US have a body mass index (BMI) of 25 or more a value considered by the National Heart Lung and Blood institute to be associated with increased health risk.As a part of a routine health check at a large college the PE department usually requires students to be measured and weighed.

This year the department decided to try out a self-report system.

It asked 200 randomly selected female students to report their heights and weights.

Only 31 of these students had BMIs greater than 25

BMIIs this proportion of high BMI-students unusually small?

First, check the conditions:

BMIBMIZ = -2.24

By the 68-95-99.7 Rule, I know that values more than 2 standard deviations below the mean of a Normal model show up less than 2.5% of the time. Perhaps women at this college differ from the general population, or self-reporting may not provide accurate heights and weights.

Just CheckingBut if you imagined all the possible samples of 100 students you could draw and imagined the histogram of all the sample proportions from these samples, what shape would it have?Just CheckingWhere would the center of that histogram be?

At the actual proportion of all students who are in favor.Another ExampleSuppose that about 13% of the population is left handed.

A 200 seat school auditorium has been built with 15 lefty seats seats that have a built in desk on the left rather than the right arm of the chair.

In a class of 90 students what is the probability that there will not be enough seats for left handed students?LeftiesI want to find the probability that in a group of 90 students more than 15 will be left handed.

Since 15 out of 90 is 16.7% I need the probability of finding more than 16.7% left-handed students out of a sample of 90 if the proportion of lefties is 13%Independence: One student being left handed is independent of another student being left handed or left handed.

Randomization: The 90 students in the class can be thought of as a random sample of students.LeftiesLeftiesQuantitative DataProportions summarize categorical variables

The Normal Sampling Distribution looks very useful

Can we do something similar with quantitative data?QuantitativeWe know that when we sample at random or randomize an experiment the results we get will vary from sample to sample and from experiment to experiment.The Normal model seems an incredibly simple way to summarize all that variation.

Could something like that work for means?

It turns out means also have a sampling distribution that we can model with a Normal model.DieSay we toss a fair die 10,000 times.

What would the histogram of the number on the face of the die look like?DiceNow lets toss a pair of dice and record the average of the two.

If we repeat this (or simulate repeating it) 10,000 times, recording the average of each pair, what will the histogram of these 10,000 averages look like?Is the average of getting a 1 on two dice as likely as getting an average of 3 or 3.5?What Happens?First, we already know from the Law of Large Numbers that as the sample size gets larger each sample average is more likely to be closer to the population mean.

That means for us here the shape continues to tighten around 3.5In addition, the shape of the distribution is becoming bell shaped.

It is approaching the Normal model.

The Fundamental Theorem of StatisticsThe sampling distribution of any mean becomes more nearly Normal as the sample size grows.

All we need is for the observations to be independent and collected with randomization. We dont care about the shape of the population distribution!Fundamental TheoremEven if we sample from a skewed or bimodal population, the CLT tells us that means of repeated random samples will tend to follow a Normal model as the sample size grows.

This works better and faster the closer the population distribution is to a Normal model.It also works better for larger samples

Central Limit TheoremThe mean of a random sample is a random variable whose sampling distribution can be approximated by a Normal model. The larger the sample, the better the approximation will be.CLTAssumptions and Conditions:

Independence Assumption: The sampled values must be independent of each other

Sample Size Assumption: The sample size must be sufficiently large

Randomization Condition: The data values must be sampled randomly or the concept of a sampling distribution makes no sense.

CLT10% Condition: When the sample is drawn without replacement (as is usually the case) the sample size n should be no more than 10% of the population.

Large Enough Sample Condition: Although the CLT tells us that a Normal model is useful in thinking about the behavior of sample means when the sample is large enough, it doesnt tell us how large a sample we need.

The truth is, it depends, there is no one-size-fits-all rule.CLTWhich Normal model?

We know the Normal model is specified by its mean and standard deviation.

For proportions, the sampling distribution is centered at the population proportion.

For means, it is centered at the population mean.

What of standard deviation?CLTSampling Distribution ModelsExample Using CLT for MeansA college PE department asked a random sample of 200 female students to self-report their heights and weights but the percentage of BMI over 25 seemed suspiciously low.

One possible explanation may be that the respondents shaded their weights down a bit.The CDC reports that the mean weight of 18 year old women is 143.74lbs with a standard deviation of 51.14lbs but these randomly selected women reported a mean weight of only 140 lbs.

Based on the CLT and the 68-95-99.7 Rule, does the mean weight in this sample seem exceptionally low or might this just be sample to sample variation?Check the ConditionRandomization Condition: The women were a random sample and their weights can be assumed to be independent.

10% Condition: They sampled fewer than 10% of all women at the collegeLarge Enough Sample Condition: The distribution of womens weights is likely to be unimodal and reasonably symmetric so the CLT applies to means of even small samples; 200 is plenty.CLT for MeansThe 68-95-99.7 Rule suggests that although the reported mean weight of 140 lbs is somewhat lower than expected it does not appear to be unusual. Such variability is not all that extraordinary for samples of this size.

Working with the Sampling Distribution Model for the MeanThe CDC reports that the mean weight of adult men in the US is 190lbs with a standard deviation of 59lbs

Question: An elevator in our building has a weight limit of 10 persons or 2500 lbs.What is the probability that if 10 men get on the elevator they will overload its weight limit?

ElevatorsAsking the probability that the total weight of a sample of 10 men exceeds 2500 pounds is equivalent to asking the probability that their mean weight is greater than 250 pounds.Now we just need to first check if our assumptions and conditions are met:ElevatorsIndependence: It is reasonable to think that the weights of 10 randomly sampled men will be independent of each other (can you think of exceptions?)

Large Enough: I suspect the distribution of population weights is roughly unimodal and symmetric so my sample of 10 men seems large enough.Randomization Condition: Ill assume that the 10 men getting on the elevator are a random sample from the population

10% Condition: 10 men is surely less than 10% of the population of elevator ridersElevatorsAbout VariationBut be careful!

As you can see on the left, the standard deviation is proportional to only the square root of the sample size.

VariationWe would need a sample size of 64 to halve it once more.

Real World and the Model World We now have two distributions to deal with.

The first is the real-world distribution of the sample which we might display with a histogram for quantitative data or a bar chart for categorical data.The second is the math world sampling distribution model of the statistic, a Normal model based on the CLT. We cannot confuse the two!Real World vs. Model World Dont mistakenly think the CLT says that the data are normally distributed as long as the sample is large enough.

In fact, as the sample gets larger it will begin to more closely resemble the real population that may be symmetric, skewed, unimodal, bimodal, etc.The CLT doesnt talk about the distribution of the data from the sample. It talks about the sample means and the sample proportions of many different random samples drawn from the same population.Just CheckingHuman gestation times have a mean of about 266 days with a standard deviation of 1 days. If we record the gestation times of a sample of 100 women do we know that a histogram of the times will be modeled by a Normal model?No, this is a histogram HomeworkPg 436, #37-45 odd