analysis of complex survey data day 5, special topics: developing weights and imputing data

Analysis of Complex Survey Data

Day 5, Special topics: Developing weights and imputing data

Part 1: Imputation using HOT DECK

What is HOT DECK?

• This procedure is designed to perform the Cox-Iannacchione Weighted Sequential Hot Deck (WSHD) imputation that is described in both Cox (1980) and Iannacchione (1982), a methodology based on a weighted sequential sample selection algorithm developed by Chromy (1979).

• Provisions are included in this procedure for multivariate imputation (several variables imputed at the same time) and multiple imputations (several imputed versions of the same variable).

Vocab

• Donor – An item respondent selected to provide a value for missing item nonrespondent data

• Imputation class – user-defined group used in the imputation process. Three categories: classes that consist of only item respondent records, class that consist of only item nonrespondent records, and classes that contain both item respondents and item nonrespondents. For classes with both item respondents and item nonrespondents, imputation is performed and donors are selected for missing values. For classes with only item respondents or only item nonrespondents, imputation is not performed.

Vocab• Imputation variable – User-defined variable that contains some

missing values on the input data file. A missing value for this variable will be populated with a donor value.

• Item Nonrespondent — A record for which imputation is performed on missing data. All records on the input data file are defined as either an item respondent or an item nonrespondent. Users define the set of item respondents. Item nonrespondents are the set of remaining records from the input file not defined as an item respondent.

• Item Respondent — A record for which values can be selected for imputation of missing item nonrespondent data. All records on the input data file are defined as either an item respondent or an itemnonrespondent. Users define the set of all item respondents.

Getting started

• Prepare a dataset that includes:– The variable(s) you want to impute– The variable(s) that will inform the imputation– ID and weighting variables

How do you decide what variables to include to inform the imputation?

• IMPORTANT ASSUMPTION: imputation assumes that, for a given variable with missing data, the missing-data mechanism within each imputation class is ignorable (also known as missing at random).

• The validity of the imputed values is dependent on how good the measures you use to inform the imputation. This decision is theoretical rather than statistical (but we can use statistics to inform our decision).

• Choose variables that are strongly related to the outcome of interest. Want response to be as homogeneous as possible within groups and as hetergeneous as possible across groups– E.g., if you are imputing depression score, you definitely want sex and

age. The others depend on what you think impacts depression score: BMI? Smoking? Race? Education? Income?

How much missing is too much?

• There is no rule that says when there is just too much missing data to use a variable

• Some people use a 10% rule• It likely depends on how important this

variable is to your analysis• Just remember that the more missing data,

the less valid the imputed variable will be.

Getting into the details

• Sorting Matters. The way in which the input file is sorted WITHIN each imputation class (defined by the IMPBY statement) will have an effect on the imputation results. The assignment of a selection probability to a potential donor, or item respondent, depends both on the donor’s weight and on the weights of nearby item nonrespondents. In other words, both the weights and the sort order of observations play a role in the selection of donors for imputation in the WSHD algorithm.

Getting into the details

Lab: Imputation using HOT DECK

Part 2: WTADJUST

Why would you need to calculate new weights?

• Nonresponse adjustment– One of the key variables in your analysis has high levels of

missingness, and you don’t want to impute– In this case, you can reestimate the sample weights taking into

account factors associated with being missing on the key variable• Post-stratification weight adjustment

– You don’t like the referent population used for calculating the sample weights in your data

– E.G., Most complex surveys weight the sample to be representative of the U.S. based on the 2000 Census – they may not be desirable if your data was collected in 2009.

– Post-stratification adjustment may also be useful to users who seek to create standardized weights or non-probability based sample weights

PROC WTADJUST

• Designed to be used to compute nonresponse and post-stratification weight adjustments

• Created using a model-based, calibration approach that is somewhat similar to what is done with PROC LOGISTIC – a generalization of the classical weighting class approach for producing weight adjustments

PROC WTADJUST

• In a model based approach:• The weight estimate allows the user to include more

main effect and lower-order interactions of variables in the weight adjustment process. This can reduce bias in estimates computed using the adjusted weights

• Allows you to estimate the statistical significance of the variables used in the adjustment process.

• Unlike traditional methods, can incorporate continuous variables.

PROC WTADJUST

• In fact, if all interaction terms are included in the weight adjustment model for a given set of categorical variables, the model-based approach is equivalent to the weighting class approach

The weight adjustment model

Where is an index corresponding to each record in the domain of interest

is the domain of interest (SUBPOPN statement)

is the final weight adjustment for each record k in . This is the key output variable from this procedure

is a weight trimming factor that will be computed before the B-parameters of the exponential model (i.e., parameters of ) are estimated

is the nonresponse or post-stratification adjustment computed after the weight trimming step


Where lower bound imposed on the adjustment

upper bound imposed on the adjustment

centering constant

a constant used to control the behavior of as the upper and lower bound get closer to the centering constant

is a vector of explanatory variables

are the model parameters that will be estimated within the procedure


is the input weight for record k (whatever is on the WEIGHT statement) dependent variable in the modeling procedure. For nonresponse adjustments, this variable should be set to one for records corresponding eligible respondents and to zero for records corresponding to ineligible cases. For post-stratification adjustments, this variable should be set to one for all records that should receive a post-stratification adjustment (if that’s everyone, just use the option “_ONE_”.

Weight Trimming

• Reducing the variance in your weights will reduce the variance in your estimates (which is good!). So, you might want to ‘trim’ the weights to be within certain bounds.

• For example, the 99 year old daily cocaine user might have a really extreme weight. We might want to reign that person in to have a weight that’s similar to a 60 year old daily cocaine user.

How do you decide the bounds on the weight trimming factor?

• There are many ways to do this.• One relatively simple approach is to parition the sample into

small subpopulations (e.g., by strata or by levels of some covariate of interest)

• Within each of the subpopulations, compute the interquartile range (IQR) of the input sample, and set:

A simple example

RECNOS Unique record identifierSAMPWT Base sampling weight for each personSTRATAPSUGENDERRACEAGEELIG Yes/No variable indicating whether or not the record is eligibleRESP Yes/No variable indicating whether the record on file

corresponds to a respondent

A simple example

To compute a nonresponse adjustment that will correct the sample weights of respondents for those people that did not respond to the survey, we use the following code:

The DESIGN=WR coupled with the NEST and WEIGHT statements provides the design information for WTADJUST so that the procedure can compute appropriate design-based variances of the model parameters B.

The variable SAMPWT is wk

A SUBPOPN elig=1 statement is used to tell SUDAAN to only consider eligible records. In this example, we seek weight adjustments that will correct the sample weights of respondents for eligible nonrespondents.

The IDVAR statement is included so that the OUTPUT file, ADJUST, contains a variable that can be used to merge the adjustments back to the original file. In this example, the merge-by variable is RECNOS.

The WTMAX and WTMIN statements are included. These are optional statements. A fixed value can be used in these statements – in this case, the fixed value applies to all records k. Optionally, a variable can be used in these statements. One could use a variable in cases where a different WTMAX and/or WTMIN is desired for different sets of respondents. In this particular example, the user would like to truncate any weight that is less than 10 or greater than 15000 prior to computing the actual nonresponse weight adjustment.

Similarly, the UPPERBD and LOWRBD statements are included. These are also optional statements. A fixed value can be used in these statements – in this case, the fixed value applies to all records k. Optionally, a variable can be used in these statements. In this particular example, the user would like to truncate or bound the resulting weight adjustments, k α , so that no weight adjustment fallsbelow 1.0 or above 3.0.

The CENTER statement is included. This is also an optional statement. A fixed value can be used in this statement – in this case, the fixed value applies to all records k. Optionally, a variable canbe used in this statement. In this particular example, the value of k c is set equal to 2.0 for each record.

The MODEL statement tells WTADJUST that RESP is the 0/1 indicatorfor response status and that the user would like to use the main effects of categorical variables GENDER and RACE in the model. If the user also wants the interaction of GENDER and RACE, then similar to all other SUDAAN procedures, they would add the term GENDER*RACE to the right hand side of the MODEL statement. The user is also specifying that AGE be included in the model as a continuous variable.

The output file

• TRIMFACTOR. This is– In our example, this variable is assigned a value

that will force to equal 10 for those records where is <10 and 15000 for those records with

>15000. For records with between 10 and 15000, the value of will be equal to 1.0

• ADJFACTOR. This will hold the values of the weight adjustment factors

Suppose in this example that the weighted sums of explanatory variables are as displayed above

Then, WTADJUST is designed to yield model-based weight adjustments ( ) that will force the adjusted weighted sum of the model explanatory variables to equal those totals displayed in the third column above. In other words, if you were to compute the weighted sum of each explanatory variable using only those records that satisfy RESP=1 and using the adjusted sample weight WTFINAL, then the totals you would obtain would be equivalent to what is in column 2.

Suppose instead that we were interested in obtaining a post-stratification adjustment that would force the nonresponse-adjusted respondent weights to equal the following controls:

Now a post-stratification example

Let’s say we merged our nonresponse-adjusted respondent weights back into the dataset and named them WTNONADJ. Then, getting the post-stratification totals is easy:

Now a post-stratification example

We no longer need weight trimming or upper and lower bounds.

The POSTWGT statement contains the control totals for the post-stratification adjustment. This numbers should correspond, in order, to the B model parameters. Unless the NOINT option is specified, SUDAAN first always includes an intercept in the mode. Consequently, the first POSTWGT value corresponds to the overall control total – in this case, that would be 116900+39100=156000. The next eight numbers in the POSTWGT statement are control totals corresponding to the GENDER*AGEGRP*RACE interaction. Note that control totals should be supplied for reference levels associated with any explanatory variable or interaction term.

Lab 5: Calculating sample weights

analysis of complex survey data day 5, special topics: developing weights and imputing data

Documents

item nonrespondents

missing data

item respondent records

item nonrespondent records

set of item respondents

imputation id

imputation process

multivariate imputation