chapter 3: producing data inferential statistics sampling designing experiments

45
Chapter 3: Producing Data • Inferential Statistics • Sampling • Designing Experiments

Post on 19-Dec-2015

226 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Chapter 3: Producing Data

• Inferential Statistics

• Sampling

• Designing Experiments

Page 2: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inferential Statistics

• We start with a question about a group or groups.• The group(s) we are interested in is(are) called

the population(s).• Examples

– What is the average number of car accidents for a person over 65 in the United States?

– For the entire world, is the IQ of women the same as the IQ of men?

– How many times a day should I feed my goldfish?– Which is more effective at lowering the heartrate of

mice, no drug (control), drug A, drug B, or drug C?

Page 3: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inferential Statistics

• Example 1: What is the average number of car accidents for a person over 65 in the United States?– How many populations are of interest?

• One

– What is the population of interest?• All people in the U.S. over age 65.

Page 4: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inferential Statistics

• Example 2: For the entire world, is the IQ of women the same as the IQ of men?– How many populations are of interest?

• Two

– What are the populations of interest?• All women and all men

Page 5: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inferential Statistics

• Example 3: How many times a day should I feed my goldfish?– How many populations are of interest?

• One

– What is the population of interest?• All pet goldfish

Page 6: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inferential Statistics

• Example 4: Which is more effective at lowering the heartrate of mice, no drug (control), drug A, drug B, or drug C?– How many populations are of interest?

• Four

– What are the populations of interest?• All mice taking no drug, all mice taking drug A, all

mice taking drug B, all mice taking drug C

Page 7: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inferential Statistics

• Suppose we have no previous information about these questions. How could we answer them?– Census

• Advantages– We get everyone, we know the truth

• Disadvantages– Expensive, Difficult to obtain, may be impossible.

– Sample• Advantages

– Less expensive. Feasible.

• Disadvantages– Uncertainty about the truth. Instead of surety we may have

error.

Page 8: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inferential Statistics

• Suppose we have no previous information about these questions. How could we answer them?– If we take a census, we have everyone and we have

no need for inference. We know.– If we take a sample, we make inference from the

sample to the whole population.– For these four questions, it is not likely we can get a

census. We will need to use a sample.– Obviously, for each population we are interested in,

we must get a separate sample.

Page 9: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inferential Statistics

• General Idea of Inferential Statistics– We take a sample from the whole population.– We summarize the sample using important

statistics.– We use those summaries to make inference

about the whole population.– We realize there may be some error involved

in making inference.

Page 10: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inferential Statistics

• Example (1988, the Steering Committee of the Physicians' Health Study Research Group)– Question: Can Aspirin reduce the risk of heart attack in

humans?– Sample: Sample of 22,071 male physicians between the

ages of 40 and 84, randomly assigned to one of two groups. One group took an ordinary aspirin tablet every other day (headache or not). The other group took a placebo every other day. This group is the control group.

– Summary statistic: The rate of heart attacks in the group taking aspirin was only 55% of the rate of heart attacks in the placebo group.

– Inference to population: Taking aspirin causes lower rate of heart attacks in humans.

Page 11: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Basics for sampling– Sampling should not be biased: no favoring of

any individual in the population.• Example of a biased sample

– Select goldfish from a particular store

– The selection of an individual in the population should not affect the selection of the next individual – independence.

• Example of non-independent sample– Choosing cards from a deck without replacement

Page 12: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Basics for sampling– Sampling should be large enough to

adequately cover the population.• Example of a small sample

– Suppose only 20 physicians were used in the aspirin study.

– Sampling should have the smallest variability possible.

• We know there is some error want to minimize it.

Page 13: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Sampling Techniques– Simple Random Sample (SRS): every

member of the population has an equal chance of being selected.

Population

Sample

Page 14: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Sampling Techniques– Simple Random Sample (SRS): every

member of the population has an equal chance of being selected

• Assign every individual a number and randomly select 30 numbers using a random number table (or computer generated random numbers).

– Example: Obtain a list of all SSN for individuals in the U.S. who are over 65. Using a random number table, select 50 of them.

– Table B at the back of the book is random digits.

Page 15: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Sampling Techniques– Stratified Random Sample: Divide the

population into several strata. Then take a SRS from each stratum.

Population

Sample

Strata 1

Strata 2

Strata 3

Page 16: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Sampling Techniques– Stratified Random Sample

• Advantage: Each stratum is guaranteed to be randomly sampled

• Example: Obtain a list of all SSN for individuals in the U.S. who are over 65. Divide up the SSNs into region of the country (time zones). Then randomly sample 30 from each time zone.

Page 17: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Sampling Techniques– Cluster Sample: Divide the population into

several strata or clusters. Then take a SRS of clusters using all the observations in each.

Population

Strata 1

Strata 2

Strata 3

Strata 4

Strata 5

Strata 6

Strata 7

Strata 8

Strata 9

Strata 1

Strata 4

Strata 9

Sample

Page 18: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Sampling Techniques– Cluster Sample

• Advantage: May be the only feasible method, given resoures.

• Example: Obtain a list of all SSNs for individuals in the U.S. who are over 65. Sort the SSNs by the last 4 digits making each set of 100 a cluster. Use a random number table to pick the clusters. You may get the 4100’s, 5600’s and 8200’s for example.

Page 19: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Sampling Techniques– Multi-Stage Sample: Divide the population into

several strata. Then take a SRS from a random subset of all the strata.

Population

Strata 1

Strata 2

Strata 3

Strata 4

Strata 5

Strata 6

Strata 7

Strata 8

Strata 9

Strata 1

Strata 4

Strata 9

Sample

Page 20: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Sampling Techniques– Multi-Stage Sample

• Advantage: May be the only feasible method, given resources.

• Example: Obtain a list of all SSN for individuals in the U.S. who are over 65. Divide up the SSNs into 50 states. Randomly select 10 states. Then randomly sample 40 from each of the selected states.

Page 21: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Sampling Problems– Voluntary response

• Internet surveys• Call-in surveys

– Convenience sampling• Sampling friends• Sampling at the mall

– Dishonesty• Asking personal questions• Not enough time to respond honestly

Page 22: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling a Single Population

• Undercoverage – Some groups in the population are left out when the sample is taken

• Nonresponse – An individual chosen for the sample can’t be contacted or does not cooperate

• Response Bias – Results that are influenced by the behavior of the respondent or interviewer– For example, the wording of questions can influence

the answers– Respondent may not want to give truthful answers to

sensitive questions

Page 23: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• We sample from more than one population when we are interested in more than one variable.

• As previously discussed, one variable is chosen to be the response variable and the other is selected as the explanatory variable.

• Examples– Comparing decibel levels of 4 different brands of speakers– Determining time to failure of 3 different types of lightbulbs– Comparing GRE scores for students from 5 different majors

Page 24: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Example 1: Comparing decibel levels of 4 different brands of speakers– What is the explanatory variable?

• Brand

– What is the response variable?• Decibel Level

– Number of Populations?• Four

– Number of Samples needed?• Four

Page 25: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Example 2: Determining time to failure of 3 different types of lightbulbs– What is the explanatory variable?

• Type

– What is the response variable?• Time to Failure

– Number of Populations?• Three

– Number of Samples needed?• Three

Page 26: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Example 3: Comparing GRE scores for students from 5 different majors– What is the explanatory variable?

• Major

– What is the response variable?• GRE score

– Number of Populations?• Five

– Number of Samples needed?• Five

Page 27: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Important Considerations– Each sample should represent the population it

corresponds to well.– Samples from more than one population should be as

close to each other in every respect as possible except for the explanatory variable. Otherwise we may have confounding variables.

– Two variables are confounded if we cannot determine which one caused the differences in the response.

Page 28: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Important Considerations– Examples of Confounding

• Suppose we compared the decibel levels of the four different speaker brands, each with a different measuring instrument

– We wouldn’t know if the differences were due to the different brands or different instruments.

– Brand and Instrument are then confounded.

• Suppose we compared the time to failure of the three different types of lightbulbs, each in a different light socket.

– We wouldn’t know if the differences were due to the different types of lightbulbs or different light sockets.

– Type and Socket confounded.

Page 29: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Important Considerations– Examples of Confounding

• Suppose we obtained GRE scores for each major, each from a different university.

– We wouldn’t know if the differences were due to the different majors or different universities.

– Major and University are then confounded.

– Confounding can be avoided by using good sampling techniques, which will be explained shortly

Page 30: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Important Considerations– It is also possible that more than one (possibly

several) explanatory variable can influence a given response variable.

– Example• Perhaps both the type of lightbulb and the type of light socket

influence the time to failure of a lightbulb.• It is likely that different types of lightbulbs work better for

different sockets.• This concept is known as interaction.

– Interaction: The responses for the levels of one variable differ over the levels of another variable.

Page 31: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Treatment

Control

Simple Random Sample

Sampling More than One Population

• Randomized Experiment– The key to a randomized experiment: the treatment

(explanatory variable) is randomly assigned to the experimental units or subjects.

Population

Statistics

Statistics

Compare

Random Assignment

Page 32: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Randomized Experiment– Example: Suppose that before we want to test the

effect of aspirin on the physicians, we wish to do a study on the effect of aspirin on mice, comparing heart rates.

• We obtain a random sample of 100 mice.• We randomly assign 50 mice to receive a placebo.• We randomly assign 50 mice to receive aspirin.• After 20 days of administering the placebo and aspirin, we

measure the heart rates and obtain summary statistics for comparison.

Page 33: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Randomized Experiment– The single greatest advantage of a randomized

experiment is that we can infer causation.– Through randomization to groups, we have controlled

all other factors and eliminated the possibility of a confounding variable.

– Unfortunately or perhaps fortunately, we cannot always use a randomized experiment

• Often impossible or unethical, particularly with humans.

Page 34: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Observational Study– We are forced to select samples from different pre-

existing populations

Pre-existing Population 1

Pre-existing Population 2

Simple Random Sample

Statistics

Statistics

Compare

Page 35: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Observational Study– Advantage: The data is much easier to obtain.– Disadvantages

• We cannot say the explanatory variable caused the response• There may be lurking or confounding variables• Observational studies should be more to describe the past, not

predict the future. – Case-Control Study: A study in which cases having a particular

condition are compared to controls who do not. The purpose is to find out whether or not one or more explanatory variables are related to a certain disease.

• Although you can’t usually determine cause and effect, these studies are more efficient and they can reduce the potential confounding variables.

Page 36: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Observational Study– Example 1: Suppose we are interested in comparing GRE

scores for students in five different majors• We cannot do a randomized experiment because we cannot

randomly assign individuals to a specific major. The individuals decide that for themselves.

• Thus, we observe students from 5 different pre-existing populations: the five majors.

• We obtain a random sample of size 15 from each of the five majors.• We calculate statistics and compare the 5 groups.• Can we say being in a specific major causes someone to get a

higher GRE score?• What are some possible confounding variables?• How might we reduce the effect of these confounding variables?

Page 37: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling More than One Population

• Observational Study– Example 2: Suppose we are interested finding out which age

group talks the most on the telephone: 0-10 years, 10-20 years, 20-30 years, or 30-40 years

• We cannot do a randomized experiment because we cannot randomly assign individuals to an age group.

• Thus, we observe (through polling or wire tapping) individuals from 4 different pre-existing populations: the four age groups.

• We obtain a random sample of size 25 from each of the four age groups.

• We calculate statistics and compare the 4 groups.• Can we say being in a specific age group causes someone to talk

more on the telephone?• What are some possible confounding variables?• How might we control these confounding variables?

Page 38: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inference Overview

• Recall that inference is using statistics from a sample to talk about a population.

• We need some background in how we talk about populations and how we talk about samples.

Page 39: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inference Overview

• Describing a Population– It is common practice to use Greek letters when talking about a

population.– We call the mean of a population .– We call the standard deviation of a population and the

variance 2.– When we are talking about percentages, we call the population

proportion . (or pi).– It is important to know that for a given population there is only

one true mean and one true standard deviation and variance or one true proportion.

– There is a special name for these values: parameters.

Page 40: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Inference Overview

• Describing a Sample– It is common practice to use Roman letters when talking about a

sample.– We call the mean of a sample .– We call the standard deviation of a sample s and the variance

s2.– When we are talking about percentages, we call the sample

proportion p.– There are many different possible samples that could be taken

from a given population. For each sample there may be a different mean, standard deviation, variance, or proportion.

– There is a special name for these values: statistics.

x

Page 41: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sample

Inference Overview

• We use sample statistics to make inference about population parameters

x

Population

Mean:

Standard Deviation:

Proportion:

sp

Page 42: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling Variability

• There are many different samples that you can take from the population.

• Statistics can be computed on each sample.

• Since different members of the population are in each sample, the value of a statistic varies from sample to sample.

Page 43: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Sampling Distribution

• The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.

• We can then examine the shape, center, and spread of the sampling distribution.

Page 44: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Bias and Variability

• Bias concerns the center of the sampling distribution. A statistic used to a parameter is unbiased if the mean of the sampling distribution is equal to the true value of the parameter being estimated.

• To reduce bias, use random sampling. The values of a statistic computed from an SRS neither consistently overestimates nor consistently underestimates the value of the population parameter.

• Variability is described by the spread of the sampling distribution.

• To reduce the variability of a statistic from an SRS, use a larger sample. You can make the variability as small as you want by taking a large enough sample.

Page 45: Chapter 3: Producing Data Inferential Statistics Sampling Designing Experiments

Bias and Variability