stat472/572 sampling: theory and practice instructor: yan luluyan/stat47257217/chapter3.pdfy„str...

Stat472/572 Sampling: Theory and Practice

Instructor: Yan Lu

1

Chapter 3: Stratified Sampling

Example: 1000 male and 100 female in population.

• Now take an SRS of size 55 from the population. Possibly we

got a sample without female.

—-Most people would not consider such a sample to be rep-

resentative of the population, since men and women might re-

spond differently on the item of interest

• Use stratified sample, we can take 50 male and 5 female

—-a sample with no or few males cannot be selected, protected

from the possibility of obtaining a really bad sample

—-increases the precision of the estimators2

Stratified Sampling

• Divide population into H subpopulations, called strata. The

strata do not overlap and they constitute the whole population

• Each sampling unit belongs to exactly one stratum

• Draw an independent probability sample from each stratum

• Pool the information to obtain overall population estimates

3

Figure 1: Stratification

4

Example 3.2: Agriculture survey (Refer to Example 2.5)

• In Example 2.5, we generated a random sample. But some areas were

overrepresented, and others not represented at all

• part of the large variability arises because counties in the western United

States are larger, and thus tend to have larger values of y, than counties in

the eastern United States

• Taking a stratified sample can provide some balance in the sample on the

stratifying variable

• We use the four census regions of the United States: Northeast (NE), North

Central (NC), South (S), and West (W) strata, and sample about 10% of the

counties in each stratum.

5

Figure 2: Boxplot of data from example 3.2. The thick line for each region is the median of

the sample data from that region; the other horizontal lines in the boxes are the 25th and 75th

percentiles. The Northeast region has a relatively small median and small variance; the West

region, however, has a much higher median and variance. The distribution of farm acreage

appears to be positively skewed in each of the regions.

NC NE S W

0.00.5

1.01.5

2.0

Region

Million

s of A

cres

6

Stratum # of counties in stratum # of counties in sample

Northeast 220 21

North Central 1054 103

South 1382 135

West 422 41

Total 3078 300

7

Table 1: Summary statistics for each stratum

region stratum size sample size average variance

Northeast 220 21 97,629.8 7,647,472,708

North Central 1045 103 300,504.2 29,618,183.543

South 1382 135 211,315.0 53,587,487,856

West 422 41 662,295.5 396,185,950,266

• We took an SRS in each stratum, for Northeast region

t1 = (220)(97, 629.81) = 21, 478, 558.2

V (t1) = (220)2(

1− 21220

)7, 647, 472, 708

21= 1.594316× 1013

8

Table 2: Estimates of the total number of farm acres and estimated variance of the total for

each of the four strata

region estimated total estimated variance of the total

Northeast 21, 478, 558.2 1.59432× 1013

North Central 316, 731, 379.4 2.88232× 1014

South 292, 037, 390.8 6.84076× 1014

West 279, 488, 706.1 1.55365× 1015

Total 909, 736, 034.4 2.5419× 1015

9

Table 3: Comparison between SRS and stratified random sampling for agriculture data

sample size t SE

SRS 300 916,927,110 58,169,381

Stratification 300 909,736,034 50,417,248

• Observations within many strata tend to be more homogeneous than observations in the

population as a whole. Reduction in variance in the individual strata often leads to a

reduced variance for the population estimate

• estimated variance from stratified sample, with n = 300estimated variance from SRS, with n = 300

=2.5419× 1015

3.3837× 1015= 0.75

• If these were the population variances, we would expect that we would need only (300)(0.75) =

225 observations with a stratified sample to obtain the same precision as from an SRS of

300 observations.

10

Comments:

• Reduce variability by eliminating possible bad samples

• May want data of known precision for subgroups

• Lower cost, convenient

• Usually reduce variability when estimating the whole popula-

tion

11

Theory of Stratified Sampling:

strata 1 2 · · · H

popn size N1 N2 · · · NH

∑Hh=1 Nh = N

sample size n1 n2 · · · nH

∑Hh=1 nh = n

popn total t1 t2 · · · tH

• Take an SRS of size nh from stratum H

12

• tstr = t1 + t2 + · · ·+ tH

•tstr = t1 + t2 + · · ·+ tH

= N1y1 + N2y2 + · · ·NH yH

•V (tstr) = V (t1) + V (t2) + · · ·+ V (tH)

=H∑

h=1

(1− nh

Nh

)N 2

hs2h

nh

13

•

ystr =tstrN

=

∑Hh=1 thN

=

∑Hh=1 Nhyh

N

=H∑

h=1

Nh

Nyh

Weighted average of stratum means

14

• Confidence intervals for stratified samples

—If either(1) the sample sizes within each stratum are large

—or (2) the sampling design has a large number of strata

According to central limit theorem (Krewski and Rao 1981), an

approximate 100(1− α)% confidence interval for the popula-

tion mean yU is

ystr ± zα/2SE(ystr)

Some survey software packages use the percentile of a t dis-

tribution with n − H degrees of freedom rather than the per-

centile of the normal distribution

15

Population quantities Sample quantities

yhj : value of jth unit in stratum h

th =Nh∑j=1

yhj th =Nh

nh

∑j∈Sh

yhj = Nhyh

t =H∑

h=1

th tstr =H∑

h=1

th =H∑

h=1

Nhyh

yhU =

Nh∑j=1

yhj

Nh

yh =1

nh

∑j∈Sh

yhj

yU =t

N=

H∑h=1

Nh∑j=1

yhj

Nystr =

tstrN

=H∑

h=1

Nh

Nyh

S2h =

Nh∑j=1

(yhj − yhU)2

Nh − 1s2

h =∑

j∈Sh

(yhj − yh)2

nh − 116

tstr = t1 + t2 + · · ·+ tH

= N1y1 + N2y2 + · · ·NH yH

V (tstr) = V (t1) + V (t2) + · · ·+ V (tH)

=H∑

h=1

(1− nh

Nh

)N2

hs2h

nh

ystr =tstrN

=H∑

h=1

Nh

Nyh

V (ystr) =1

N2V (tstr) =

H∑

h=1

(1− nh

Nh

)(Nh

N

)2s2

h

nh

17

Properties of the estimators:

• E[tstr] = t

• E[ystr] = yU

• V (tstr) is an unbiased estimator of V (tstr)

• V (ystr) is an unbiased estimator of V (ystr)

18

E[tstr] = E[H∑

h=1

Nhyh]

=H∑

h=1

NhE(yh)

=H∑

h=1

NhyhU =H∑

h=1

th = t

E[ystr] = E[tstrN

]

=t

N= yU

19

Stratified sampling for proportions

Special case of mean when

yi =

1 if the unit has the characteristic

0 otherwise

20

yh = ph

s2h =

nh

nh − 1ph(1− ph)

pstr =H∑

h=1

Nh

Nph

V (pstr) =H∑

h=1

(1− nh

Nh

)(Nh

N

)2ph(1− ph)

nh − 1

tstr =H∑

h=1

Nhph

V (tstr) = N 2V (pstr)

21

Example 3.4. The American Council of Learned Societies (ACLS)

used a stratified random sample of selected ACLS societies in

seven disciplines to study publication patterns and computer

and library use among scholars who belong to one of the mem-

ber organizations of the ACLS. The data is shown in the follow-

ing table.

22

Discipline Membership # mailed valid returns female

Nh nh members(%)

Literature 9100 915 636 38

Classics 1950 633 451 27

Philosophy 5500 658 481 18

History 10850 855 611 19

Linguistics 2100 667 493 36

Political Science 5500 833 575 13

Sociology 9000 824 588 26

Totals 44000 5385 3835

• Want to estimate the percentage and number of female members of the major societies in

those seven disciplines

23

• Ignoring the nonresponse, assume no duplicate memberships

pstr =7∑

h=1

Nh

Nph

=910044000

× .38 + · · ·+ 900044000

× .26

= .2465

SE(pstr) =

√√√√7∑

h=1

(1− nh

Nh

)(Nh

N

)2ph(1− ph)

nh − 1

= .0071

The estimated total number of female members in the societies is

tstr = 44000× .2465 = 10847

with

SE(tstr) = 44000× .0071 = 312

24

Review: Stratified random sampling

Strata 1 2 · · · H

Population size N1 N2 · · · NH

∑Hh=1 Nh = N

Sample size n1 n2 · · · nH

∑Hh=1 nh = n

Population total t1 t2 · · · tH

25

Population quantities Sample quantities

yhj : value of jth unit in stratum h same

th =Nh∑j=1

yhj th =Nh

nh

∑j∈Sh

yhj = Nhyh

t =H∑

h=1

th tstr =H∑

h=1

th =H∑

h=1

Nhyh

yhU =

Nh∑j=1

yhj

Nh

yh =1

nh

∑j∈Sh

yhj

yU =t

N=

H∑h=1

Nh∑j=1

yhj

Nystr =

tstrN

=H∑

h=1

Nh

Nyh

S2h =

Nh∑j=1

(yhj − yhU)2

Nh − 1s2

h =∑

j∈Sh

(yhj − yh)2

nh − 126

Properties of the estimators:

• E[tstr] = t

• E[ystr] = yU

Confidence intervals for stratified samples

—If either(1) the sample sizes within each stratum are large

—or (2) the sampling design has a large number of strata

According to central limit theorem (Krewski and Rao 1981), an approximate

100(1− α)% confidence interval for the population mean yU is

ystr ± zα/2SE(ystr)

Some survey software packages use the percentile of a t distribution with

n−H degrees of freedom rather than the percentile of the normal distrib-

ution

27

Using Weights

Sampling weights: the number of units in the population represented by each sample

member (h, j), h: stratum, j: elements.

tstr =H∑

h=1

Nhyh

=H∑

h=1

∑

j∈Sh

Nh

nhyhj

=H∑

h=1

∑

j∈Sh

whjyhj

where whj =Nh

nh

ystr =

H∑h=1

∑j∈Sh

whjyhj

H∑h=1

∑j∈Sh

whj

28

Example: Suppose a population has 2000 units, 1600 of them

are males (stratum 1), and 400 are females (stratum 2). If the

sample has 400 units, 200 units from each stratum, then,

π1j =200

1600=

1

8and w1j =

1

π1j= 8

π2j =200

400=

1

2and w2j =

1

π2j= 2

• each man in the sample represents 8 men in the population

• each woman in the sample represents 2 women in the popula-

tion

29

• πhj = nh/Nh

• whj = Nh/nh

• tstr =H∑

h=1

th =H∑

h=1

Nhyh =H∑

h=1

∑j∈Sh

Nh

nh

yhj =H∑

h=1

∑j∈Sh

whjyhj

• V (tstr) =H∑

h=1

V (th) =H∑

h=1

N2h

(1− nh

Nh

)S2

h

nh

• ystr = tstr/N =H∑

h=1

Nh

Nyh =

H∑h=1

∑j∈Sh

whjyhj

H∑h=1

∑j∈Sh

whj

• V (ystr) = V (tstr)/N2 =

H∑h=1

N2h

N2

(1− nh

Nh

)S2

h

nh

30

Comments:

• Let πhj be the probability of selecting unit j from stratum h. Then whj =

1/πhj = Nh/nh

• ∑Hh=1

∑i∈Sh

whj =∑H

h=1

∑i∈Sh

Nh

nh

=H∑

h=1

Nh = N

—-The whole sample represents the entire population and sum of the weights

is equal to the population size

• tstr =∑H

h=1

∑j∈Sh

whjyhj

• ystr =∑H

h=1

∑j∈Sh

whjyhj/∑H

h=1

∑j∈Sh

whj

31

Back to the previous example. Suppose a population has 2000

units, 1600 of them are males (stratum 1), and 400 are females

(stratum 2). If we randomly select 160 males from stratum 1

and 40 women from stratum 2,

π1j =160

1600=

1

10and w1j =

1

π1j= 10

π2j =40

400=

1

10and w2j =

1

π2j= 10

# of sampled units in each stratum is proportional to the size of

the stratum. We call this allocation method proportional alloca-

tion

32

Proportional Allocation: # of sampled units in each stratum is proportional

to the size of the stratum

nh

Nh

=n

N, nh = Nh

n

N

πhj =nh

Nh

=n

Nand whj =

1

πhj

=N

n

Sample is self-weighting

ystr =H∑

h=1

Nh

Nyh =

H∑

h=1

Nh

N

∑j∈Sh

yhj

nh

=H∑

h=1

1

n

∑j∈Sh

yhj =1

n

H∑

h=1

∑j∈Sh

yhj

= y

33

Variances:

Vprop(ystr) =(1− n

N

) 1

n

∑

h

Nh

NS2

h

Vprop(tstr) =(1− n

N

) N

n

∑

h

NhS2h

34

ANOVA Table

SSB df Sum of Squares

Between strata SSB H − 1H∑

h=1

Nh∑j=1

(yhU − yU)2

=H∑

h=1

Nh(yhU − yU)2

Within Strata SSW N −HH∑

h=1

Nh∑j=1

(yhj − yhU)2

=H∑

h=1

(Nh − 1)S2h

Total corrected SSTO N − 1H∑

h=1

Nh∑j=1

(yhj − yU)2

= (N − 1)S2

SSTO = SSB +SSW35

Comparison between SRS and proportional allocation

V (tstr) = V

(H∑

h=1

Nhyh

)=

H∑

h=1

N2h

(1− nh

Nh

)S2

h

nh

=H∑

h=1

(1− n

N

) N

nNhN2

hS2h =

H∑

h=1

(1− n

N

) N

nNhS2

h

=(1− n

N

) N

n

[SSW +

H∑

h=1

S2h

]

V (tsrs) =(1− n

N

)N2 S2

n

=(1− n

N

) N2

n

1N − 1

(SSW + SSB)

≈(1− n

N

) N

n(SSW + SSB)

36

Proportional stratification is more efficient, if

H∑

h=1

S2h < SSB

where SSB =H∑

h=1

Nh(yhU − yU )2.

This is usually true, since the large population sizes of the strata will force Nh(yhU −yU )2 > S2

h

Comments

• In general, the variance of the estimator of t from a stratified sample with proportional

allocation will be smaller than the variance of the estimator of t from SRS with the same

number of observations

• The more unequal the stratum means yhU , the more homogeneous the within stratum

units, the more precision you will gain by using proportional allocation.

37

Optimal AllocationExample: Want to take a sample of American corporations to estimate the amount of trade

with Europe

• The variation among large corporations would be greater than the variation among small

ones

—-often, large units are more variable than small units

• Need to sample a higher percentage of the large corporations

• Proportional allocation won’t work well in this situation

—-Proportional allocation has same percentage of sampling within each stratum

—-If the variances S2h are similar, proportional allocation is a good choice

—-If the variances S2h vary substantially, we may want to take more samples from the

strata with larger variances

38

Cost function

c = c0 +H∑

h=1

chnh

where c0 is the overhead costs, such as maintaining an office, ch is the cost

of sampling an observation in stratum h

• Want to minimize V (tstr) for a fixed cost c or minimize c for a fixed V (tstr)

V (tstr) =H∑

h=1

N2h

(1− nh

Nh

)S2

h

nh

=H∑

h=1

N2h

S2h

nh

−H∑

h=1

NhS2h

—–Same as minimizeH∑

h=1

N2h

S2h

nh

39

f =H∑

h=1

N2h

S2h

nh

+ λ

(c0 +

H∑

h=1

chnh − c

)

∂f

∂nh

=−N2

hS2h

n2h

+ λch = 0

nh =NhSh√

chλ

by the fact that∑

h nh = n we have

1√λ

=n∑H

l=1 NlSl/√

cl

nh,opt = n×(

NhSh/√

ch∑Hl=1 NlSl/

√cl

)

40

nh,opt ∝ NhSh√ch

We take a larger sample from stratum h if

• The stratum size Nh is large

• The variance within the stratum Sh is large

• The sampling within the stratum ch is inexpensive

41

nh,opt = n×(

NhSh/√

ch∑Hl=1 NlSl/

√cl

)

Neyman allocation: ch’s are all equal

nh,Neyman = n×(

NhSh∑Hl=1 NlSl

)

Let a =n∑l=H

l=1 NlSl

Recall

nh,Neyman = n×(

NhSh∑Hl=1 NlSl

)

42

so that nh,Neyman = a×NhSh

V (tstr,Neyman) =H∑

h=1

(1− nh

Nh

)N2

hS2h

nh

=H∑

h=1

(1− aNhSh

Nh

)N2

hS2h

aNhSh

=H∑

h=1

(1− aSh)NhSh

a

=H∑

h=1

(1− n∑H

l=1 NlSl

Sh

)NhSh

∑Hl=1 NlSl

n

=

H∑h=1

NhSh

H∑l=1

NlSl

n−

H∑

h=1

NhS2h

43

V (tstr,Prop) =H∑

h=1

(1− nh

Nh

)N 2

h

nhS2

h

=H∑

h=1

(1− n

N

) N

nNhS

2h

=H∑

h=1

N

nNhS

2h −

H∑

h=1

NhS2h

44

H∑

h=1

NhSh

H∑

l=1

NlSl =H∑

h=1

N 2hS2

h + 2H∑

i=1

H∑j>i

NiNjSiSj

H∑

h=1

NNhS2h =

H∑

h=1

N 2hS2

h +H∑

i=1

H∑j>i

NiNj(S2i + S2

j )

V (tstr,Neyman) ≤ V (tstr,prop)

Relative precision of stratification and srs

V (tstr,Neyman) ≤ V (tstr,Prop) ≤ Vsrs(t)

45

Example 3.9, Dollar stratification is often used in accounting. The recorded book amounts

are used to stratify the population. If auditing the loan amounts for a financial institution

stratum 1 might consist of all loans of more than $1 million, S2h will be much larger in this

stratum, need a higher sampling fraction for this stratum

stratum 2 might consist of loans between $500,000 and $999,999 · · ·smallest stratum of loans less than $10,000

• Optimal allocation is often an efficient strategy for such a stratification

— If the goal of the audit is to estimate the dollar discrepancy between the audited amounts

and the amounts in the institution’s books, an error in the recorded amount of one of the

$3,000,000 loans is likely to contribute more to the audited difference than an error in the

recorded amount of one of the $3,000 loans. In a survey such as this, you may even want

to use sample size N1 in stratum 1.

46

Some design issues of stratified random sampling

• Allocating observations to strata

—-Proportional allocation:nh

Nh=

n

N—-Optimal allocation: Neyman allocation: ch’s are all equal

nh,Neyman = n

NhSh

H∑l=1

NlSl

• Sample size

• Defining strata: variables and number of strata

47

Determining sample size

V (tstr) =H∑

h=1

N2h

(1− nh

Nh

)S2

h

nh

≤H∑

h=1

N2h ·

S2h

nh

=1

n

H∑

h=1

n

nh

N2hS2

h = v/n

• v depends on stratum size Nh, variances S2h, and on the relative sample

sizes nh/n

• v can be thought of as the “average” variability per observation unit in a

stratified random sample with the specified allocation

95 % CI: tstr ± zα/2

√v/n

zα/2

√v/n = e, n = z2

α/2v/e2

48

Defining Strata:

1. Variables for stratification

• Highly associated with variables of interest

—–For estimating total business expenditures on advertising,

we might stratify by number of employees or size of the busi-

ness and by the type of product or service

—–For farm income, we might use the size of the farm as a

stratifying variable, since we expect that larger farms would

have higher incomes

• Known for all sampling units in the population

49

2. Number of strata:

• Depends upon many factors such as the difficulty in construct-

ing a sampling frame with stratifying information, and the cost

of stratifying

• Formulas in literature

• Pilot study

• General rule: the more information you have about the pop-

ulation, the more strata you should use. You should use an

SRS when little prior information about the target population is

available.

50

Recall: Relative precision of stratification and SRS

V (tstr,opt) ≤ V (tstr,prop) ≤ Vsrs(t)

1. Stratified sampling provides higher precision than SRS, why conduct SRS?

• Stratification adds complexity to the survey, which may not be worth a small

gain in precision

• Need information which units and how many units belong to each stratum

2. When stratified sampling is efficient?

• SSB is large (strata means differ greatly)

• SSW is small (variability within stratum is small)

51

Example: National Pesticide Survey (NPS)

US Environmental Protection Agency (EPA) sampled drinking wells to esti-

mate the prevalence of pesticides and nitrate between 1988 and 1990.

• Want a sample that was representative of drinking water wells in the United

States

• Want to guarantee that wells in the sample would have a wide range of

levels of pesticide use and susceptibility to ground-water pollution

• Want to study two categories of wells: (1)Community water systems (CWS)

—systems of piped drinking water with at least 15 connections and/or 25 or

more permanent residents with at least one working well

and (2) rural domestic wells

—supplying occupied housing in rural areas, not on government property

52

1. Frame issue: how many drinking wells exist in the United States?

• For CWS, list with addresses is in the Federal Reporting Data

System (FRDS), maintained by EPA, There are approximately

51,000 CWSs.

• The 1980 census data is used to estimate number of rural do-

mestic wells. There are about 13 million rural domestic wells.

53

2. Stratification issue: EPA choose stratification design, which variables are

used to construct strata?

• EPA developed criteria for separating the population of CWS wells and

rural domestic wells into four categories of pesticide use and three relative

ground-water vulnerability measures. This design ensures that the range of

variability that exists nationally with respect to the agricultural use of pesti-

cides and ground-water vulnerability is reflected in the sample of wells.

• Pesticide use obtained from

—marketing research

—proportion of county in agricultural use

• Ground-water vulnerability measures (by DRASTIC)

• Four categories of pesticide use: high, moderate, low, uncommon; Three

categories of groundwater vulnerability: high, moderate, low gives 12 strata54

Table 4: Strata for National Pesticide Survey

Stratum pesticide use groundwater vulnerability number of

(estimated by DRASTIC) counties

1 high high 106

2 high moderate 234

3 high low 129

4 moderate high 110

5 moderate moderate 204

6 moderate low 267

7 low high 193

8 low moderate 375

9 low low 404

10 uncommon high 186

11 uncommon moderate 513

12 uncommon low 416

55

3. Design considerations

—For CWS, assume 0.5% of wells contain pesticides; choose

n so that the probability of detection is 90%.

—For rural wells, there were some subgroups of particular in-

terest; assume a 1% rate and 97% probability of detection.

—n = 564 public, 734 private Rural wells

56

4. Rural wells

—-Each county (N = 3137) categorized according to the strati-

fication variables.

—-Sample counties;

—-Characterize pesticide use and groundwater vulnerability for

subcounty areas.

—-No subcounty areas selection for CWS wells

57

Model-based inference for stratified sampling

• The one-way ANOVA model with fixed effects provides an un-

derlying structure for stratified sampling.

yhj = µh + εhj (1)

where εhj are independent with mean 0 and variance σ2h.

• The least squares estimator of µh is yh, the average in stratum

h

58

Estimators and Properties:

• Th =Nh∑j=1

yhj : the total in stratum h

• T =H∑

h=1

Th: the overall total

• Note that both Th and T are random variables

• The best linear unbiased estimator for Th is Th =Nh

nh

∑

j∈Sh

yhj .

• EM [Th − Th] = 0

• EM [(Th − Th)2] = N2h

(1− nh

Nh

)σ2

h

nh

59

By the fact that observations in different strata are independent under the model

EM [(T − T )2] = EM

{H∑

h=1

(Th − Th)

}2

= EM

H∑

h=1

(Th − Th)2 +H∑

h=1

∑

k 6=h

(Th − Th)(Tk − Tk)

= EM

[H∑

h=1

(Th − Th)2]

=H∑

h=1

N2h

(1− nh

Nh

)σ2

h

nh

60

Comments:

• The theoretical variance σ2h can be estimated by s2

h

• Adopting the model in (1) results in the same estimation for t

and its standard error as found under randomization theory.

• If a different model is used, however, then different estimators

are obtained.

61

stat472/572 sampling: theory and practice instructor: yan luluyan/stat47257217/chapter3.pdfy„str...

Documents