stat472/572 sampling: theory and practice instructor: yan luluyan/stat47257217/chapter3.pdfy„str...
TRANSCRIPT
Stat472/572 Sampling: Theory and Practice
Instructor: Yan Lu
1
Chapter 3: Stratified Sampling
Example: 1000 male and 100 female in population.
• Now take an SRS of size 55 from the population. Possibly we
got a sample without female.
—-Most people would not consider such a sample to be rep-
resentative of the population, since men and women might re-
spond differently on the item of interest
• Use stratified sample, we can take 50 male and 5 female
—-a sample with no or few males cannot be selected, protected
from the possibility of obtaining a really bad sample
—-increases the precision of the estimators2
Stratified Sampling
• Divide population into H subpopulations, called strata. The
strata do not overlap and they constitute the whole population
• Each sampling unit belongs to exactly one stratum
• Draw an independent probability sample from each stratum
• Pool the information to obtain overall population estimates
3
Figure 1: Stratification
4
Example 3.2: Agriculture survey (Refer to Example 2.5)
• In Example 2.5, we generated a random sample. But some areas were
overrepresented, and others not represented at all
• part of the large variability arises because counties in the western United
States are larger, and thus tend to have larger values of y, than counties in
the eastern United States
• Taking a stratified sample can provide some balance in the sample on the
stratifying variable
• We use the four census regions of the United States: Northeast (NE), North
Central (NC), South (S), and West (W) strata, and sample about 10% of the
counties in each stratum.
5
Figure 2: Boxplot of data from example 3.2. The thick line for each region is the median of
the sample data from that region; the other horizontal lines in the boxes are the 25th and 75th
percentiles. The Northeast region has a relatively small median and small variance; the West
region, however, has a much higher median and variance. The distribution of farm acreage
appears to be positively skewed in each of the regions.
NC NE S W
0.00.5
1.01.5
2.0
Region
Million
s of A
cres
6
Stratum # of counties in stratum # of counties in sample
Northeast 220 21
North Central 1054 103
South 1382 135
West 422 41
Total 3078 300
7
Table 1: Summary statistics for each stratum
region stratum size sample size average variance
Northeast 220 21 97,629.8 7,647,472,708
North Central 1045 103 300,504.2 29,618,183.543
South 1382 135 211,315.0 53,587,487,856
West 422 41 662,295.5 396,185,950,266
• We took an SRS in each stratum, for Northeast region
t1 = (220)(97, 629.81) = 21, 478, 558.2
V (t1) = (220)2(
1− 21220
)7, 647, 472, 708
21= 1.594316× 1013
8
Table 2: Estimates of the total number of farm acres and estimated variance of the total for
each of the four strata
region estimated total estimated variance of the total
Northeast 21, 478, 558.2 1.59432× 1013
North Central 316, 731, 379.4 2.88232× 1014
South 292, 037, 390.8 6.84076× 1014
West 279, 488, 706.1 1.55365× 1015
Total 909, 736, 034.4 2.5419× 1015
9
Table 3: Comparison between SRS and stratified random sampling for agriculture data
sample size t SE
SRS 300 916,927,110 58,169,381
Stratification 300 909,736,034 50,417,248
• Observations within many strata tend to be more homogeneous than observations in the
population as a whole. Reduction in variance in the individual strata often leads to a
reduced variance for the population estimate
• estimated variance from stratified sample, with n = 300estimated variance from SRS, with n = 300
=2.5419× 1015
3.3837× 1015= 0.75
• If these were the population variances, we would expect that we would need only (300)(0.75) =
225 observations with a stratified sample to obtain the same precision as from an SRS of
300 observations.
10
Comments:
• Reduce variability by eliminating possible bad samples
• May want data of known precision for subgroups
• Lower cost, convenient
• Usually reduce variability when estimating the whole popula-
tion
11
Theory of Stratified Sampling:
strata 1 2 · · · H
popn size N1 N2 · · · NH
∑Hh=1 Nh = N
sample size n1 n2 · · · nH
∑Hh=1 nh = n
popn total t1 t2 · · · tH
• Take an SRS of size nh from stratum H
12
• tstr = t1 + t2 + · · ·+ tH
•tstr = t1 + t2 + · · ·+ tH
= N1y1 + N2y2 + · · ·NH yH
•V (tstr) = V (t1) + V (t2) + · · ·+ V (tH)
=H∑
h=1
(1− nh
Nh
)N 2
hs2h
nh
13
•
ystr =tstrN
=
∑Hh=1 thN
=
∑Hh=1 Nhyh
N
=H∑
h=1
Nh
Nyh
Weighted average of stratum means
14
• Confidence intervals for stratified samples
—If either(1) the sample sizes within each stratum are large
—or (2) the sampling design has a large number of strata
According to central limit theorem (Krewski and Rao 1981), an
approximate 100(1− α)% confidence interval for the popula-
tion mean yU is
ystr ± zα/2SE(ystr)
Some survey software packages use the percentile of a t dis-
tribution with n − H degrees of freedom rather than the per-
centile of the normal distribution
15
Population quantities Sample quantities
yhj : value of jth unit in stratum h
th =Nh∑j=1
yhj th =Nh
nh
∑j∈Sh
yhj = Nhyh
t =H∑
h=1
th tstr =H∑
h=1
th =H∑
h=1
Nhyh
yhU =
Nh∑j=1
yhj
Nh
yh =1
nh
∑j∈Sh
yhj
yU =t
N=
H∑h=1
Nh∑j=1
yhj
Nystr =
tstrN
=H∑
h=1
Nh
Nyh
S2h =
Nh∑j=1
(yhj − yhU)2
Nh − 1s2
h =∑
j∈Sh
(yhj − yh)2
nh − 116
tstr = t1 + t2 + · · ·+ tH
= N1y1 + N2y2 + · · ·NH yH
V (tstr) = V (t1) + V (t2) + · · ·+ V (tH)
=H∑
h=1
(1− nh
Nh
)N2
hs2h
nh
ystr =tstrN
=H∑
h=1
Nh
Nyh
V (ystr) =1
N2V (tstr) =
H∑
h=1
(1− nh
Nh
)(Nh
N
)2s2
h
nh
17
Properties of the estimators:
• E[tstr] = t
• E[ystr] = yU
• V (tstr) is an unbiased estimator of V (tstr)
• V (ystr) is an unbiased estimator of V (ystr)
18
E[tstr] = E[H∑
h=1
Nhyh]
=H∑
h=1
NhE(yh)
=H∑
h=1
NhyhU =H∑
h=1
th = t
E[ystr] = E[tstrN
]
=t
N= yU
19
Stratified sampling for proportions
Special case of mean when
yi =
1 if the unit has the characteristic
0 otherwise
20
yh = ph
s2h =
nh
nh − 1ph(1− ph)
pstr =H∑
h=1
Nh
Nph
V (pstr) =H∑
h=1
(1− nh
Nh
)(Nh
N
)2ph(1− ph)
nh − 1
tstr =H∑
h=1
Nhph
V (tstr) = N 2V (pstr)
21
Example 3.4. The American Council of Learned Societies (ACLS)
used a stratified random sample of selected ACLS societies in
seven disciplines to study publication patterns and computer
and library use among scholars who belong to one of the mem-
ber organizations of the ACLS. The data is shown in the follow-
ing table.
22
Discipline Membership # mailed valid returns female
Nh nh members(%)
Literature 9100 915 636 38
Classics 1950 633 451 27
Philosophy 5500 658 481 18
History 10850 855 611 19
Linguistics 2100 667 493 36
Political Science 5500 833 575 13
Sociology 9000 824 588 26
Totals 44000 5385 3835
• Want to estimate the percentage and number of female members of the major societies in
those seven disciplines
23
• Ignoring the nonresponse, assume no duplicate memberships
pstr =7∑
h=1
Nh
Nph
=910044000
× .38 + · · ·+ 900044000
× .26
= .2465
SE(pstr) =
√√√√7∑
h=1
(1− nh
Nh
)(Nh
N
)2ph(1− ph)
nh − 1
= .0071
The estimated total number of female members in the societies is
tstr = 44000× .2465 = 10847
with
SE(tstr) = 44000× .0071 = 312
24
Review: Stratified random sampling
Strata 1 2 · · · H
Population size N1 N2 · · · NH
∑Hh=1 Nh = N
Sample size n1 n2 · · · nH
∑Hh=1 nh = n
Population total t1 t2 · · · tH
25
Population quantities Sample quantities
yhj : value of jth unit in stratum h same
th =Nh∑j=1
yhj th =Nh
nh
∑j∈Sh
yhj = Nhyh
t =H∑
h=1
th tstr =H∑
h=1
th =H∑
h=1
Nhyh
yhU =
Nh∑j=1
yhj
Nh
yh =1
nh
∑j∈Sh
yhj
yU =t
N=
H∑h=1
Nh∑j=1
yhj
Nystr =
tstrN
=H∑
h=1
Nh
Nyh
S2h =
Nh∑j=1
(yhj − yhU)2
Nh − 1s2
h =∑
j∈Sh
(yhj − yh)2
nh − 126
Properties of the estimators:
• E[tstr] = t
• E[ystr] = yU
Confidence intervals for stratified samples
—If either(1) the sample sizes within each stratum are large
—or (2) the sampling design has a large number of strata
According to central limit theorem (Krewski and Rao 1981), an approximate
100(1− α)% confidence interval for the population mean yU is
ystr ± zα/2SE(ystr)
Some survey software packages use the percentile of a t distribution with
n−H degrees of freedom rather than the percentile of the normal distrib-
ution
27
Using Weights
Sampling weights: the number of units in the population represented by each sample
member (h, j), h: stratum, j: elements.
tstr =H∑
h=1
Nhyh
=H∑
h=1
∑
j∈Sh
Nh
nhyhj
=H∑
h=1
∑
j∈Sh
whjyhj
where whj =Nh
nh
ystr =
H∑h=1
∑j∈Sh
whjyhj
H∑h=1
∑j∈Sh
whj
28
Example: Suppose a population has 2000 units, 1600 of them
are males (stratum 1), and 400 are females (stratum 2). If the
sample has 400 units, 200 units from each stratum, then,
π1j =200
1600=
1
8and w1j =
1
π1j= 8
π2j =200
400=
1
2and w2j =
1
π2j= 2
• each man in the sample represents 8 men in the population
• each woman in the sample represents 2 women in the popula-
tion
29
• πhj = nh/Nh
• whj = Nh/nh
• tstr =H∑
h=1
th =H∑
h=1
Nhyh =H∑
h=1
∑j∈Sh
Nh
nh
yhj =H∑
h=1
∑j∈Sh
whjyhj
• V (tstr) =H∑
h=1
V (th) =H∑
h=1
N2h
(1− nh
Nh
)S2
h
nh
• ystr = tstr/N =H∑
h=1
Nh
Nyh =
H∑h=1
∑j∈Sh
whjyhj
H∑h=1
∑j∈Sh
whj
• V (ystr) = V (tstr)/N2 =
H∑h=1
N2h
N2
(1− nh
Nh
)S2
h
nh
30
Comments:
• Let πhj be the probability of selecting unit j from stratum h. Then whj =
1/πhj = Nh/nh
• ∑Hh=1
∑i∈Sh
whj =∑H
h=1
∑i∈Sh
Nh
nh
=H∑
h=1
Nh = N
—-The whole sample represents the entire population and sum of the weights
is equal to the population size
• tstr =∑H
h=1
∑j∈Sh
whjyhj
• ystr =∑H
h=1
∑j∈Sh
whjyhj/∑H
h=1
∑j∈Sh
whj
31
Back to the previous example. Suppose a population has 2000
units, 1600 of them are males (stratum 1), and 400 are females
(stratum 2). If we randomly select 160 males from stratum 1
and 40 women from stratum 2,
π1j =160
1600=
1
10and w1j =
1
π1j= 10
π2j =40
400=
1
10and w2j =
1
π2j= 10
# of sampled units in each stratum is proportional to the size of
the stratum. We call this allocation method proportional alloca-
tion
32
Proportional Allocation: # of sampled units in each stratum is proportional
to the size of the stratum
nh
Nh
=n
N, nh = Nh
n
N
πhj =nh
Nh
=n
Nand whj =
1
πhj
=N
n
Sample is self-weighting
ystr =H∑
h=1
Nh
Nyh =
H∑
h=1
Nh
N
∑j∈Sh
yhj
nh
=H∑
h=1
1
n
∑j∈Sh
yhj =1
n
H∑
h=1
∑j∈Sh
yhj
= y
33
Variances:
Vprop(ystr) =(1− n
N
) 1
n
∑
h
Nh
NS2
h
Vprop(tstr) =(1− n
N
) N
n
∑
h
NhS2h
34
ANOVA Table
SSB df Sum of Squares
Between strata SSB H − 1H∑
h=1
Nh∑j=1
(yhU − yU)2
=H∑
h=1
Nh(yhU − yU)2
Within Strata SSW N −HH∑
h=1
Nh∑j=1
(yhj − yhU)2
=H∑
h=1
(Nh − 1)S2h
Total corrected SSTO N − 1H∑
h=1
Nh∑j=1
(yhj − yU)2
= (N − 1)S2
SSTO = SSB +SSW35
Comparison between SRS and proportional allocation
V (tstr) = V
(H∑
h=1
Nhyh
)=
H∑
h=1
N2h
(1− nh
Nh
)S2
h
nh
=H∑
h=1
(1− n
N
) N
nNhN2
hS2h =
H∑
h=1
(1− n
N
) N
nNhS2
h
=(1− n
N
) N
n
[SSW +
H∑
h=1
S2h
]
V (tsrs) =(1− n
N
)N2 S2
n
=(1− n
N
) N2
n
1N − 1
(SSW + SSB)
≈(1− n
N
) N
n(SSW + SSB)
36
Proportional stratification is more efficient, if
H∑
h=1
S2h < SSB
where SSB =H∑
h=1
Nh(yhU − yU )2.
This is usually true, since the large population sizes of the strata will force Nh(yhU −yU )2 > S2
h
Comments
• In general, the variance of the estimator of t from a stratified sample with proportional
allocation will be smaller than the variance of the estimator of t from SRS with the same
number of observations
• The more unequal the stratum means yhU , the more homogeneous the within stratum
units, the more precision you will gain by using proportional allocation.
37
Optimal AllocationExample: Want to take a sample of American corporations to estimate the amount of trade
with Europe
• The variation among large corporations would be greater than the variation among small
ones
—-often, large units are more variable than small units
• Need to sample a higher percentage of the large corporations
• Proportional allocation won’t work well in this situation
—-Proportional allocation has same percentage of sampling within each stratum
—-If the variances S2h are similar, proportional allocation is a good choice
—-If the variances S2h vary substantially, we may want to take more samples from the
strata with larger variances
38
Cost function
c = c0 +H∑
h=1
chnh
where c0 is the overhead costs, such as maintaining an office, ch is the cost
of sampling an observation in stratum h
• Want to minimize V (tstr) for a fixed cost c or minimize c for a fixed V (tstr)
V (tstr) =H∑
h=1
N2h
(1− nh
Nh
)S2
h
nh
=H∑
h=1
N2h
S2h
nh
−H∑
h=1
NhS2h
—–Same as minimizeH∑
h=1
N2h
S2h
nh
39
f =H∑
h=1
N2h
S2h
nh
+ λ
(c0 +
H∑
h=1
chnh − c
)
∂f
∂nh
=−N2
hS2h
n2h
+ λch = 0
nh =NhSh√
chλ
by the fact that∑
h nh = n we have
1√λ
=n∑H
l=1 NlSl/√
cl
nh,opt = n×(
NhSh/√
ch∑Hl=1 NlSl/
√cl
)
40
nh,opt ∝ NhSh√ch
We take a larger sample from stratum h if
• The stratum size Nh is large
• The variance within the stratum Sh is large
• The sampling within the stratum ch is inexpensive
41
nh,opt = n×(
NhSh/√
ch∑Hl=1 NlSl/
√cl
)
Neyman allocation: ch’s are all equal
nh,Neyman = n×(
NhSh∑Hl=1 NlSl
)
Let a =n∑l=H
l=1 NlSl
Recall
nh,Neyman = n×(
NhSh∑Hl=1 NlSl
)
42
so that nh,Neyman = a×NhSh
V (tstr,Neyman) =H∑
h=1
(1− nh
Nh
)N2
hS2h
nh
=H∑
h=1
(1− aNhSh
Nh
)N2
hS2h
aNhSh
=H∑
h=1
(1− aSh)NhSh
a
=H∑
h=1
(1− n∑H
l=1 NlSl
Sh
)NhSh
∑Hl=1 NlSl
n
=
H∑h=1
NhSh
H∑l=1
NlSl
n−
H∑
h=1
NhS2h
43
V (tstr,Prop) =H∑
h=1
(1− nh
Nh
)N 2
h
nhS2
h
=H∑
h=1
(1− n
N
) N
nNhS
2h
=H∑
h=1
N
nNhS
2h −
H∑
h=1
NhS2h
44
H∑
h=1
NhSh
H∑
l=1
NlSl =H∑
h=1
N 2hS2
h + 2H∑
i=1
H∑j>i
NiNjSiSj
H∑
h=1
NNhS2h =
H∑
h=1
N 2hS2
h +H∑
i=1
H∑j>i
NiNj(S2i + S2
j )
V (tstr,Neyman) ≤ V (tstr,prop)
Relative precision of stratification and srs
V (tstr,Neyman) ≤ V (tstr,Prop) ≤ Vsrs(t)
45
Example 3.9, Dollar stratification is often used in accounting. The recorded book amounts
are used to stratify the population. If auditing the loan amounts for a financial institution
stratum 1 might consist of all loans of more than $1 million, S2h will be much larger in this
stratum, need a higher sampling fraction for this stratum
stratum 2 might consist of loans between $500,000 and $999,999 · · ·smallest stratum of loans less than $10,000
• Optimal allocation is often an efficient strategy for such a stratification
— If the goal of the audit is to estimate the dollar discrepancy between the audited amounts
and the amounts in the institution’s books, an error in the recorded amount of one of the
$3,000,000 loans is likely to contribute more to the audited difference than an error in the
recorded amount of one of the $3,000 loans. In a survey such as this, you may even want
to use sample size N1 in stratum 1.
46
Some design issues of stratified random sampling
• Allocating observations to strata
—-Proportional allocation:nh
Nh=
n
N—-Optimal allocation: Neyman allocation: ch’s are all equal
nh,Neyman = n
NhSh
H∑l=1
NlSl
• Sample size
• Defining strata: variables and number of strata
47
Determining sample size
V (tstr) =H∑
h=1
N2h
(1− nh
Nh
)S2
h
nh
≤H∑
h=1
N2h ·
S2h
nh
=1
n
H∑
h=1
n
nh
N2hS2
h = v/n
• v depends on stratum size Nh, variances S2h, and on the relative sample
sizes nh/n
• v can be thought of as the “average” variability per observation unit in a
stratified random sample with the specified allocation
95 % CI: tstr ± zα/2
√v/n
zα/2
√v/n = e, n = z2
α/2v/e2
48
Defining Strata:
1. Variables for stratification
• Highly associated with variables of interest
—–For estimating total business expenditures on advertising,
we might stratify by number of employees or size of the busi-
ness and by the type of product or service
—–For farm income, we might use the size of the farm as a
stratifying variable, since we expect that larger farms would
have higher incomes
• Known for all sampling units in the population
49
2. Number of strata:
• Depends upon many factors such as the difficulty in construct-
ing a sampling frame with stratifying information, and the cost
of stratifying
• Formulas in literature
• Pilot study
• General rule: the more information you have about the pop-
ulation, the more strata you should use. You should use an
SRS when little prior information about the target population is
available.
50
Recall: Relative precision of stratification and SRS
V (tstr,opt) ≤ V (tstr,prop) ≤ Vsrs(t)
1. Stratified sampling provides higher precision than SRS, why conduct SRS?
• Stratification adds complexity to the survey, which may not be worth a small
gain in precision
• Need information which units and how many units belong to each stratum
2. When stratified sampling is efficient?
• SSB is large (strata means differ greatly)
• SSW is small (variability within stratum is small)
51
Example: National Pesticide Survey (NPS)
US Environmental Protection Agency (EPA) sampled drinking wells to esti-
mate the prevalence of pesticides and nitrate between 1988 and 1990.
• Want a sample that was representative of drinking water wells in the United
States
• Want to guarantee that wells in the sample would have a wide range of
levels of pesticide use and susceptibility to ground-water pollution
• Want to study two categories of wells: (1)Community water systems (CWS)
—systems of piped drinking water with at least 15 connections and/or 25 or
more permanent residents with at least one working well
and (2) rural domestic wells
—supplying occupied housing in rural areas, not on government property
52
1. Frame issue: how many drinking wells exist in the United States?
• For CWS, list with addresses is in the Federal Reporting Data
System (FRDS), maintained by EPA, There are approximately
51,000 CWSs.
• The 1980 census data is used to estimate number of rural do-
mestic wells. There are about 13 million rural domestic wells.
53
2. Stratification issue: EPA choose stratification design, which variables are
used to construct strata?
• EPA developed criteria for separating the population of CWS wells and
rural domestic wells into four categories of pesticide use and three relative
ground-water vulnerability measures. This design ensures that the range of
variability that exists nationally with respect to the agricultural use of pesti-
cides and ground-water vulnerability is reflected in the sample of wells.
• Pesticide use obtained from
—marketing research
—proportion of county in agricultural use
• Ground-water vulnerability measures (by DRASTIC)
• Four categories of pesticide use: high, moderate, low, uncommon; Three
categories of groundwater vulnerability: high, moderate, low gives 12 strata54
Table 4: Strata for National Pesticide Survey
Stratum pesticide use groundwater vulnerability number of
(estimated by DRASTIC) counties
1 high high 106
2 high moderate 234
3 high low 129
4 moderate high 110
5 moderate moderate 204
6 moderate low 267
7 low high 193
8 low moderate 375
9 low low 404
10 uncommon high 186
11 uncommon moderate 513
12 uncommon low 416
55
3. Design considerations
—For CWS, assume 0.5% of wells contain pesticides; choose
n so that the probability of detection is 90%.
—For rural wells, there were some subgroups of particular in-
terest; assume a 1% rate and 97% probability of detection.
—n = 564 public, 734 private Rural wells
56
4. Rural wells
—-Each county (N = 3137) categorized according to the strati-
fication variables.
—-Sample counties;
—-Characterize pesticide use and groundwater vulnerability for
subcounty areas.
—-No subcounty areas selection for CWS wells
57
Model-based inference for stratified sampling
• The one-way ANOVA model with fixed effects provides an un-
derlying structure for stratified sampling.
yhj = µh + εhj (1)
where εhj are independent with mean 0 and variance σ2h.
• The least squares estimator of µh is yh, the average in stratum
h
58
Estimators and Properties:
• Th =Nh∑j=1
yhj : the total in stratum h
• T =H∑
h=1
Th: the overall total
• Note that both Th and T are random variables
• The best linear unbiased estimator for Th is Th =Nh
nh
∑
j∈Sh
yhj .
• EM [Th − Th] = 0
• EM [(Th − Th)2] = N2h
(1− nh
Nh
)σ2
h
nh
59
By the fact that observations in different strata are independent under the model
EM [(T − T )2] = EM
{H∑
h=1
(Th − Th)
}2
= EM
H∑
h=1
(Th − Th)2 +H∑
h=1
∑
k 6=h
(Th − Th)(Tk − Tk)
= EM
[H∑
h=1
(Th − Th)2]
=H∑
h=1
N2h
(1− nh
Nh
)σ2
h
nh
60
Comments:
• The theoretical variance σ2h can be estimated by s2
h
• Adopting the model in (1) results in the same estimation for t
and its standard error as found under randomization theory.
• If a different model is used, however, then different estimators
are obtained.
61