7. sampling & sample size determination ldr 280

36
SAMPLING: SAMPLING: Process of Selecting Process of Selecting your Observations your Observations (Masoud Hemmasi, (Masoud Hemmasi, Ph.D.) Ph.D.)

Upload: shifa-najam

Post on 14-Apr-2017

75 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: 7. sampling & sample size determination ldr 280

SAMPLING:SAMPLING:

Process of Selecting your Process of Selecting your ObservationsObservations

(Masoud Hemmasi, Ph.D.)(Masoud Hemmasi, Ph.D.)

Page 2: 7. sampling & sample size determination ldr 280

SAMPLING: Process of Selecting your SAMPLING: Process of Selecting your ObservationsObservations

• QUESTION:

During presidential election campaigns, in a typical poll, of the potentially 100 million potential voters, how many would you say are contacted?

• History and Evolution of Political Polling

Page 3: 7. sampling & sample size determination ldr 280

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

Types of Probability Sampling: Simple (Unrestricted) Random Sampling

Complex (Restricted) Probability Sampling:Some times offer more efficient alternatives to Simple Random Sampling

b. Stratified Random Samplingc. Cluster Samplinga. Systematic Samplingd. Convenience Samplinge. Double Sampling

Page 4: 7. sampling & sample size determination ldr 280

Simple Random (or Unrestricted) Simple Random (or Unrestricted) SamplingSampling A sampling procedure in which A sampling procedure in which every elementevery element in the population in the population has a has a knownknown and and equal chanceequal chance of being selected as a subject of being selected as a subject (e.g., drawing names out of a hat).(e.g., drawing names out of a hat).

Types of Probability Sampling:

Advantage: Advantage: has the least bias and offers the most generalizability.has the least bias and offers the most generalizability.

Disadvantage: Disadvantage: At times, can be inefficient/expensive.At times, can be inefficient/expensive.

Page 5: 7. sampling & sample size determination ldr 280

Systematic SamplingSystematic Sampling If a If a sample size of sample size of nn is desired from a is desired from a population containing population containing NN elements, we might sample one element for elements, we might sample one element for every every nn//NN elements elements in the population. in the population.

First, we First, we randomly select one of the first randomly select one of the first nn//NN elements elements from from the population list. the population list.

We then We then select every select every nn//NNth elementth element that follows in the that follows in the population list. population list.

This method has the properties of a simple random sample,This method has the properties of a simple random sample, especially if the list of the population elements is a randomespecially if the list of the population elements is a random ordering. ordering.

Page 6: 7. sampling & sample size determination ldr 280

Systematic SamplingSystematic Sampling AdvantageAdvantage: : The sample usually will be easier to identify than itThe sample usually will be easier to identify than it would be if simple random sampling were used. would be if simple random sampling were used.

ExampleExample: : Selecting every 100Selecting every 100thth listing in a telephone book listing in a telephone book after the first randomly selected listingafter the first randomly selected listing

Page 7: 7. sampling & sample size determination ldr 280

The The population is first divided intopopulation is first divided into groups called groups called stratastrata with with respect to respect to salient/relevant characteristicssalient/relevant characteristics (e.g., gender, age, race, (e.g., gender, age, race, department, location, industry, etc.) department, location, industry, etc.)

Stratified Random SamplingStratified Random Sampling

Each element in the population belongs to one and only oneEach element in the population belongs to one and only one stratum. stratum.

Best results are obtained when the elements within each stratumBest results are obtained when the elements within each stratum are as much alike as possible (i.e. a are as much alike as possible (i.e. a homogeneous grouphomogeneous group).).

A simple A simple random sample random sample is taken is taken from each stratumfrom each stratum..

Advantage: If strata are homogeneous, this method is Advantage: If strata are homogeneous, this method is as “preciseas “precise”” as simple random sampling as simple random sampling but with a smaller total sample sizebut with a smaller total sample size..

Page 8: 7. sampling & sample size determination ldr 280

Cluster SamplingCluster Sampling The The population is first divided intopopulation is first divided into separate groups called separate groups called clustersclusters..

Ideally, Ideally, each clustereach cluster would be a would be a small-scale versionsmall-scale version (representative) (representative) of the populationof the population..

A simple A simple random sample of the clustersrandom sample of the clusters is then taken. is then taken.

All elements within eachAll elements within each selected cluster will selected cluster will make up the finalmake up the final sample sample..

ExampleExample: A primary application is : A primary application is area samplingarea sampling, where clusters, where clusters are are city blockscity blocks or other well-defined areas (neighborhoods, or other well-defined areas (neighborhoods, precincts, school districts, etc.).precincts, school districts, etc.).

Page 9: 7. sampling & sample size determination ldr 280

Cluster SamplingCluster Sampling

Advantage: Advantage: The The close proximity of elementsclose proximity of elements can be can be cost and time cost and time effective effective (i.e. many sample observations can be obtained in (i.e. many sample observations can be obtained in a short time). a short time).

DisadvantageDisadvantage: : This method generally requires a larger totalThis method generally requires a larger total sample size than simple or stratified random sampling. sample size than simple or stratified random sampling.

Page 10: 7. sampling & sample size determination ldr 280

Convenience SamplingConvenience Sampling It is a It is a nonprobabilitynonprobability sampling techniquesampling technique.. Items are included in the sample Items are included in the sample without knownwithout known probabilities probabilities of being selected. of being selected.

ExampleExample: A professor conducting research might use: A professor conducting research might use student volunteersstudent volunteers to constitute a sample. to constitute a sample.

The sample is identified primarily The sample is identified primarily by convenienceby convenience..

Advantage: Advantage: Sample selection and data collection areSample selection and data collection are relatively easy.relatively easy.

Disadvantage:Disadvantage: It is impossible to determine howIt is impossible to determine how representative of the population the sample is.representative of the population the sample is.

Page 11: 7. sampling & sample size determination ldr 280

Sample Size Determination

Sampling Process of Selecting your observationsSampling Process of Selecting your observations

Page 12: 7. sampling & sample size determination ldr 280

Standard Deviation—What does it measure?

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

Sx

• Variations/differences in scores among members of a group with respect to a given characteristic (e.g., test scores for a class, income).

• Standard deviation represents the average distance of a group of numbers from their mean.

How do we calculate it?Hint: You can think of it as the average deviation from the

norm/typical.

For a Population:

For a Sample:

Page 13: 7. sampling & sample size determination ldr 280

Income level for particular a class like this:Income level for particular a class like this:Xs = Incomes of students in an MBA Class$6,000 $6,000$15,000$16,000$39,000$38,000 $50,000$70,000

ΣX = $240,000

Average = x = $240,000 / 8 = $30,000

Part-Time Employed

Part-Time Employed

Grad Assistants

Page 14: 7. sampling & sample size determination ldr 280

X X - x (X - x )2

6,000 -24,000 576,000,000

6,000 -24,000 576,000,000

15,000 -15,000 225,000,000

16,000 -14,000 196,000,000

39,000 9,000 81,000,000

38,000 8,000 64,000,000

50,000 20,000 400,000,000

70,000 40,000 1,600,000,000

Sum ( x) 240,000 0 3,718,000,000

Average = x 30,000

Variance = 2 3,718,000,000 / 8 = 464,750,000

Std. Dev. = $21,558.06

Page 15: 7. sampling & sample size determination ldr 280

Life of a randomly drawn light bulb: 100 – 5 Z x 100 + 5 Z Z = 1 for 68% confidence, Z = 1.96 for 95% confidence, Z = 3for 99% confidence

Formula: X = x + Z x (Where Z is an index that reflects the level of confidence/certainty with which we wish to estimate x.)

SAMPLING: Process of Selecting Your ObservationsSAMPLING: Process of Selecting Your Observations

x = 100 hrsx = 5 hrs

X= Hours

Freq

85 90 95 100 105 110 115

What can we say about the expected life of a randomly selected bulb (xi) = ?

Suppose frequency distribution of life of light bulbs is normal.

………

……………..

…………………….

……………………………..

.

…………

………………..

……………………….

………………………………………..

x = life of light bulbs—e.g., 3 bulbs lasted 108 hrs each

xi

Page 16: 7. sampling & sample size determination ldr 280

True Population Mean = μ = Σxi / n = 45 / 10 = $4.5

Population Standard Deviation:

Income of a randomly drawn person (Xi) = ?

= 2.87

Income Distribution for a hypothetical populationIncome Distribution for a hypothetical population

$1

$3

$0

$4

$2

$5

$6 $7 $8 $9

Page 17: 7. sampling & sample size determination ldr 280

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

This formula: X = x + Z x is ONLY applicable when the population distribution is NORMAL

What is the Distribution of our hypothetical population?

Page 18: 7. sampling & sample size determination ldr 280

Distribution of the Hypothetical PopulationDistribution of the Hypothetical Population10987654321

$0 $1 $2 $3 $4 $5 $6 $7 $8 $9

* * * * * * * * * * x

Uniform Distribution

Page 19: 7. sampling & sample size determination ldr 280

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

• If (and only if) we know that our sample mean ( x ) comes from a normally distributed population, the same formula can be modified and applied.

NOTE that X is the X of a sample of size n = 1

What is the generic formula for mean (X) of samples of any size (any n)?That is, what if instead of a single observation/case (X), we draw a random sample of a particular size from the population? Can we say something about the mean of that sample--X?

Rather than X = x + Z x use X = x + Z x

But, what does this statement mean?

Std. Error

X = x + Z x

Page 20: 7. sampling & sample size determination ldr 280

Sampling Distribution Sampling Distribution = Frequency distribution of sample means= Frequency distribution of sample means

Sampling Distribution for Samples of Size Sampling Distribution for Samples of Size n = 2 n = 2 (from our earlier population)(from our earlier population)

Sample # SAMPLE MEAN (X) 1 $0 & $1 0.5

2 $0 & $2 1.0

3 $0 & $3 1.5

. . .

. . .

10 $1 & $2 1.5

11 $1 & $3 2.0

12 $1 & $4 2.5

. . .

. . .

18 $2 & $3 2.5

19 $2 & $4 3.0

. . .

. . .

43 $7 & $8 7.5

44 $7 & $9 8.0

45 $8 & $9 8.5

45 Possible Samples of size n = 2, thus 45 possible sample means.

Distribution of these 45 sample means is called Sampling Distribution! See next slide!!!

Mean of all the 45 sample means xs = x = x = 4.5 (i.e., the same as mean of the original population

So, the earlier statement means: if these sample means are normally distributed, we can use the related formula.

x = Standard Error is the standard dev. of these Xs

Page 21: 7. sampling & sample size determination ldr 280

Sampling Distribution of Sampling Distribution of Samples of Size n=2Samples of Size n=2

x = ($0+$1)/2=$.50

($0+$3)/2=$1.50 &

($1+$2)/2=$1.50

μx =

x

# SAMPLE MEAN1 $0 & $1 0.52 $0 & $2 1.03 $0 & $3 1.5. . .. . .

10 $1 & $2 1.511 $1 & $3 2.0. . .. . .

44 $7 & $9 8.045 $8 & $9 8.5

Page 22: 7. sampling & sample size determination ldr 280

.

…………

………--..…..

……………………….

………………………………………..

………

……X……..

…….…….X….…….

……………………………..

We will be able to say the following about the mean ( x ) of a randomly selected sample: x = x + Z x

Since μX = μX , substitute x for x : x = x + Z x

SAMPLING: Process of Selecting Your ObservationsSAMPLING: Process of Selecting Your Observations

x = x

x = Standard Error = x / n

Freq So, if we know that distribution of our Sample Means (i.e., Sampling Distribution) is NORMAL, as shown below:

x

Page 23: 7. sampling & sample size determination ldr 280

Answer: Shows the relationship between x and x.--So, if x comes from a normal distribution, we can rewrite the formula to estimate x based on value of x.

Question: But, is the sampling distribution (i.e., distribution of x ) always normal (so that we can use the above formula)? Let’s see it!

QUESTION: What is the primary purpose of sampling?Answer: To use sample characteristics (e.g., X) as estimates

of population characteristics (e.g., x) What is the significance of this formula? x = x + Z x

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

x = x + Z xxx + Z x

Page 24: 7. sampling & sample size determination ldr 280

Think of these as distributionof life of all individual lightbulbs (X).

Think of these as distribution

of average life of samples

of n light bulbs (X).

(n = 1) (n = 1) (n = 1) (n = 1)

Distribution of Sample Means (Xs) for Different Population Distributions

Page 25: 7. sampling & sample size determination ldr 280

As n increases, sampling distribution (i.e., distribution of Xs) will more and more resemble a normal distribution so that for all n > 30, sampling distribution will always be normal, regardless of the distribution of the original population.

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

Conclusion?

Page 26: 7. sampling & sample size determination ldr 280

Distribution of Xs

Mean of Xs = x

Std. Dev. of Xs =x

SAMPLING: Process of Selecting Your SAMPLING: Process of Selecting Your ObservationsObservations

sx

1x

2xn1>30

Distribution of for all samples of the same size (Sampling Distribution)

Mean of = = x

Std. Error = =

n2 >30n3 >30

••

Xs

Sampling distribution is guaranteed to be normal only when n 30 is used.

Variable of interest X is NOT normally distributed.

3xsx

sx

x

x

Page 27: 7. sampling & sample size determination ldr 280

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

So, for samples of n 30:

x = X + Z x

SO, x = X + Z x / nNow, Let’s examine the elements of this formula!

__

Standard Error = x = x / n√

Page 28: 7. sampling & sample size determination ldr 280

x = X + Z x / n

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

1) We are interested in estimating x from x

2) Estimation involves a margin of error, that is

3) Actual Score = Estimate + Margin of Error

Estimate Actual Score

_

Margin of Error, lets call it “E”

So, when using random samples of size n > 30, margin of error in estimation would be:

E = Z x / n

Page 29: 7. sampling & sample size determination ldr 280

• x (population Std. Dev.) is often unknown. Sx (Std. Dev.of a sample) is a reasonable estimate (substitute) for it.

• Sx can be estimated based on previous studies or a pilot study.

n = Z2 S2x / E2

E = Z x / √ nSquare both sides of the equation:

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

E2 = Z2 2

x / nRewrite it to solve for n:

n = Z2 2x / E2

Page 30: 7. sampling & sample size determination ldr 280

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

Sample size required for estimating a population mean*x):

n = Z2 S2x / E2

n = Sample size required E = Margin of error we are willing/able to tolerate in estimating the population characteristic (mean) Z = An index reflecting the degree of confidence/ certainty we wish to have in achieving the level of precision/accuracy represented by E above. S = An estimate of Std. Dev. of the characteristic being estimated/studied.

* The case of n for estimating a population proportion will be covered later.

Page 31: 7. sampling & sample size determination ldr 280

An example:Suppose you were to use a random sample to estimate average IQ of adult males. Suppose you know, from a pilot study that the Std. Dev. of males’ IQ is about 16 points. What size sample should you use if you wish to be 95% sure that your margin of error in estimating average IQ is no more than 3 points (that is if you wish to be 95% sure that the estimate you will obtain from the sample would be within +3 points of the actual/true average IQ of the adult male population)?

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

n = Z2 S2 / E2

Z = 2S = 16 n = 22 (16)2 / 32 = 113.78 round up = 114E = 3

Z = ? S = ? E = ?

Page 32: 7. sampling & sample size determination ldr 280

An Example:Suppose we were to use a random sample to estimate average IQ of adult males. Further suppose that we have absolutely no basis for determining the Std. Dev. of males’ IQ. But, we know that the IQ of the overwhelming majority of adult males ranges between 80 and 120. What size sample should we use if we wish to be 99% sure that our margin of error in estimating the average IQ is no more than 2 points (that is if we wish to be 99% sure that the estimate we will obtain from the sample would be within +2 points of the actual/true average IQ of the adult male population)?

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

Assuming Assuming worst case scenario worst case scenario when S is unknown:when S is unknown:

n = Z2 S2 / E2

If no information is available on S, you can assume maximum variability by setting S = ¼ of Range.

Range = 120 – 80 = 40S = 40/4 = 10Z=3 n = 32 ( 10)2 / 22 = 225E=2

Page 33: 7. sampling & sample size determination ldr 280

E = Z S / \/ n

Assessing Resulting Accuracy/Precision of the Estimates, Given a Particular Sample Size:

• Suppose, we used a survey with lots of 7-point scale items,• Collected data from 225 respondents, and • Descriptive statistics on the data shows typical Std. Dev. on

most items/variables is in the 1.3 to 1.5 range.• What can we say about the precision/accuracy of our

results, say, with 95% confidence/certainty?

n = Z2 S2x / E2

E2 = Z2 S2 / n

We can be 95% certain that the sample mean for a typical variable is not off from the true population mean by more than two-tenth of a point. (e.g., if the reported sample mean on a given variable is 4.7, we can be 95% sure that the actual population mean is between 4.5 and 4.9).

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

E = 2 (1.5) / \/ 225 = 3/15 = .2 ?

Page 34: 7. sampling & sample size determination ldr 280

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

Sample size determination for estimating Sample size determination for estimating ProportionsProportions ( (pp):):

EXAMPLE:EXAMPLE: Projecting the Projecting the percentagepercentage of people who would be voting of people who would be votingfor a particular candidate in a presidential election. for a particular candidate in a presidential election.

In such cases, dispersion is measured by = pqpq (instead of variance, ss22)Where, p =p = proportion of the population that is expected to have the

attribute under study, andq = q = (1(1- p- p)), , the proportion of the population that is expected NOT to have

that attribute

So, the sample size formula will change to: n = ZSo, the sample size formula will change to: n = Z22 pq / E pq / E22

Or :Or :

NOTE:NOTE: If we have If we have no basis for judgingno basis for judging the expected value of the expected value of pp, we can , we can assume assume maximum variabilitymaximum variability (i.e., err on the side of overestimating the (i.e., err on the side of overestimating the required sample size) required sample size) by setting p at p=0.50by setting p at p=0.50 (see the example on next slid). (see the example on next slid).

n = Z2 p(1-p) / E2

Page 35: 7. sampling & sample size determination ldr 280

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations

Sample size determination for Estimating Sample size determination for Estimating ProportionsProportions::

EXAMPLE:EXAMPLE: Suppose you are to project the Suppose you are to project the percentagepercentage of potential voters who would be of potential voters who would be

expected to vote for the Republican candidate in the upcoming expected to vote for the Republican candidate in the upcoming presidential election. Suppose you have no basis for presidential election. Suppose you have no basis for estimating/guessing what the percentage could possibly be. Also, estimating/guessing what the percentage could possibly be. Also, suppose that you want to be 99% confident/certain that your margin of suppose that you want to be 99% confident/certain that your margin of error would be 3% (i.e., 99% certain that your projection/estimate will be error would be 3% (i.e., 99% certain that your projection/estimate will be within within ++ 3% of the actual number). What size sample will you need? 3% of the actual number). What size sample will you need? n = Z2 p(1-p) / E2

Z = 3 p = 0.50 E = 0.03

n = Z2 p(1-p) / E2

n = 32 ( 0.5) (0.5) / 0.032

n = 9 (0.25) / 0.0009 = 25002500

Page 36: 7. sampling & sample size determination ldr 280

QUESTIONS OR COMMENTS

?

SAMPLING: Process of Selecting your ObservationsSAMPLING: Process of Selecting your Observations