complexities of complex survey design analysis. why worry about this? many government studies use...

Complexities of Complex Survey Design Analysis

Why worry about this?

• Many government studies use these designs– CDC National Health Interview Survey (NHIS)– National Health and Nutrition Examination Survey

(NHANES) also CDC– National Longitudinal Survey of Youth– Medicare Beneficiary Survey (MCBS)– Almost any survey seeking a representative sample

from a large population will have a complex multi-stage probability sampling methodology.

Why do we care?

• These studies will make their data available to researchers at a very minimal cost (sometimes even free

• Getting free data seems great but the analysis challenges are considerable as well.

• The studies do not always document the study design very well so it can be difficult to understand how deal with it.

Today’s talk

• Will not deal with all of the issues

• Will start at the basics and lead up to some of the complexities.

• Will talk about how various software deals with some of the complexities.

Usual assumptions

• Infinite populations.– Never true but can be “true enough”– Most methods work under the infinite population

assumption.– This will hold if N is very, very large and n is not

too big relative to N (ie N >> n)– Survey design people are sort of the statistical

version of numerical analysts. ie what to do when the analysis environment is not infinite.

Background Types of sampling

• Simple random sampling with replacement– Easiest to deal with– Population size N sample size n– Each population unit has probability 1/N of being

selected to be in the sample.– Drawback – each population unit can be selected

multiple times (ie repeat information)– If N is large, the probability of any unit being

selected twice is small.

More Background

• Simple random sampling without replacement– Unequal probability for population unit to be in the

sample.– First unit selected has probability 1/N.– Second unit selected has probability 1/(N-1)– nth unit has probability 1/(N-n+1)– if N >> n and N is large then 1/N ≈ 1/(N-1) ≈ ... ≈ 1/(N-n+1)So approximately the same as simple random sampling with replacement. Can use FPC (finite population correction) ((N-n)/(N-1))1/2. Note if N>>n then this is ≈ 1

Why complex sampling

• Cost (main reason)– simpler and more cost effective • May differentially sample easy units versus difficult to

sample units. eg homeless, minorities, rural

– Harder to sample units• Want to account for inclusion difficulty of certain types

of population units.

Sampling Strategies

• Strata

• Clusters

• Weights

Strata

• Strata – Fixed known groups• regions, groups of countries states

– Not sampled -- however sampling within strata is not equal across strata.

– All Strata are included

Adjusting for Strata

• Assume two strata with N1=100 and N2=10 elements.

• sample of size 20 from N1 and 8 from N2. Assume with replacement to make the math easier.

• so P = .2 in strata 1 and P=.8 from strata 2.• Use inverse probability to weight analyses• weights for strata w1 = 1/.2 =5 and for strata 2

w2 = 1/.8 = 1.25

Example

• Want to estimate job openings in a town.• Large businesses have more job openings than

small business. Say that you have 10 large businesses and 100 small business. Sample get a sample of 28 businesses with 20 small businesses and 8 large businesses. Use the probability weights from the previous slide.

• Let x be the number of job openings in each business.

Example continued

• Total job openings =wi xi where the weights are 5 if in strata 1 (small businesses) and weights are 1.25 if in strata 2. Note that

w1*n1 + w2*n2 = 110 -- the population size.So the idea is that businesses sampled from strata 1 look like 5 businesses, while businesses sampled from strata 2 look like 1.25 businesses.Complex survey design works on population totals and the resulting proportions.Note in this case the PSU – primary sampling unit is a business.

With no weights (assumes equal weighting)

Cumulative Cumulative open Frequency Percent Frequency Percent -------------------------------------------------------------- 0 6 21.43 6 21.43 1 4 14.29 10 35.71 2 3 10.71 13 46.43 3 2 7.14 15 53.57 4 2 7.14 17 60.71 5 2 7.14 19 67.86 6 1 3.57 20 71.43 10 1 3.57 21 75.00 13 1 3.57 22 78.57 15 1 3.57 23 82.14 20 1 3.57 24 85.71 22 1 3.57 25 89.29 25 1 3.57 26 92.86 27 1 3.57 27 96.43 30 1 3.57 28 100.00

Total job openings 202*3.93 = 793 Over estimate because weights large companies equal to small companies. (110/28 = 3.93) 7.2 per company

With weights (unequal sampling) Cumulative Cumulative open Frequency Percent Frequency Percent --------------------------------------------------------- 0 30 27.27 30 27.27 1 20 18.18 50 45.45 2 15 13.64 65 59.09 3 10 9.09 75 68.18 4 10 9.09 85 77.27 5 10 9.09 95 86.36 6 5 4.55 100 90.90 10 1.26 1.15 101.26 92.05 13 1.25 1.14 102.51 93.18 15 1.25 1.14 103.76 94.32 20 1.25 1.14 105.01 95.45 22 1.25 1.14 106.26 96.59 25 1.25 1.14 107.51 97.73 27 1.25 1.14 108.76 98.86 30 1.25 1.14 110.01 100.00

Total job openings 402.6 or around 402 (3.6 per company)

types of weights

• pweights – Inverse probability weights. Also known as sampling weights wi = 1/pi.

• fweights – Frequency weights. Used when one record represents a number of identical records.

• aweights -- Analytic weights, are weights that are inversely proportional to the variance of an observation (meta-analysis)

• iweights – Importance weights weights that indicate the "importance" of the observation in some nonstatistical sense.

Replicate Weights

• Series of weights used to correct standard errors

• Used to more securely protect the identity of the respondents

• Two common kinds– Balanced Repeated Replicates (BRR)– Jack-Knife (JK-1)

Add clustering

• Strata are fixed groups that are all used and are mutually exclusive– eg Big companies and small companies

• Clusters are sampled. Unit sampled is the PSUEg strata Region:Urban/RuralCluster zip code sample zip codes in region (PSU)Sample person residing in zip code area.Unequal sampling of PSU in strata then unequal sampling of individual in zip code area. Use conditional probabilities to get weights at various levels.Units within a cluster are likely to be more similar (ie smaller variability)

NHANES Sampling design(Continuous)

• The NHANES sample is designed to be nationally representative of the civilian, non-institutionalized U.S. population, in that it does not include persons residing in nursing homes, institutionalized persons, or U.S. nationals living abroad. Thus, for NHANES 1999-2010, each year's sample and any combination of samples from consecutive years comprise a nationally representative sample of the resident, non-institutionalized U.S. population.

• Stage 1: Primary sampling units (PSUs) are selected. These are mostly single counties or, in a few cases, groups of contiguous counties with probability proportional to a measure of size (PPS).

• Stage 2: The PSUs are divided up into segments (generally city blocks or their equivalent). As with each PSU, sample segments are selected with PPS.

• Stage 3: Households within each segment are listed, and a sample is randomly drawn. In geographic areas where the proportion of age, ethnic, or income groups selected for oversampling is high, the probability of selection for those groups is greater than in other areas.

• Stage 4: Individuals are chosen to participate in NHANES from a list of all persons residing in selected households. Individuals are drawn at random within designated age-sex-race/ethnicity screening subdomains. On average, 1.6 persons are selected per household.

• http://www.cdc.gov/nchs/tutorials/NHANES/SurveyDesign/SampleDesign/Info1.htm

Weight calculation

Implications of sampling design

• Strata – no definition of strata – says two county PSUs are selected per strata so strata exist.

• Variables that sampling is based on– stage 1 : PSU county: size of county(PPS-probability

proportional to size so larger counties have greater probability of selection)

– stage 2: segment: Size of segment (PPS – see above)– stage 3: household:age, ethnic, income group– stage 4: individual: age-sex-race/ethnicity

Sample weights

• numerical sample weight assigned to each participant– number of people in the population represented

by that particular sampled person– includes adjustments for• unequal selection• non-response• control totals (make sure estimates of age, sex, and

race/ethnicity categories match known population totals)

Variance Estimates

• Unequal weighting causes complications in variance estimation

• Can use:– Taylor series estimate– BRR – Balanced Repeated Replicates (if weights are

provided)• get a lot of subsample weights, calculate the estimate a

bunch of times and take the variance of these estimates.

– Jack Knife (if weights are provided)• see above

How?

• You Can’t do this on your calculator

• Sudaan (the original)• STATA (says it is better)• SAS (has come out with survey procedures)

• Getting variances always seems to be the issue (although unbiased estimates are usually a good thing).

Example of SAS code

PROC SURVEYMEANS data=d.ncsdxdm3 ; strata str ; cluster secu ; var deplt1 gadlt1 ; weight p1fwt ;run ;

Example of STATA code

• svyset county [pw = pwvar], strata(state) fpc(fpcvar) school, fpc(fpcvar2)

• This sets up the design• Use

svy: functioneg svy: mean svy: regresssvy modules are listed in the STATA documentation

complexities of complex survey design analysis. why worry about this? many government studies use...

Documents

n n survey design people

large population

sample of size

strata w1

representative sample

inverse probability

large businesses

infinite population