missing data imputation: what it is and why its not cheating · 2014. 11. 7. · 1 missing data...

1www.Quant.KU.edu

Missing Data Imputation: What

it is and why its not cheating

Todd D. Little

University of KansasDirector, Quantitative Training Program

Director, Center for Research Methods and Data Analysis

Director, Undergraduate Research Methods and Data Analysis Minor

Member, Developmental Psychology Training Program

www. uant. .edu

Workshop presented 02-18-2010, University of Arizona

2www.Quant.KU.edu

Missing Data

• Learn about the different types of missing data

• Learn about ways in which the missing data process

can be recovered

• Learn why not imputing missing data is likely to lead

to errors in generalization

• Understand why imputing missing data is not cheating

• Introduce a simple method for significance testing

• Discuss imputation with large datasets

• Questions about your missing data issues?

3www.Quant.KU.edu

Types of Missing Data

• Missing Completely at Random (MCAR)• No association with unobserved variables (selective

process) and no association with observed variables

• Missing at Random (MAR)• No association with unobserved variables, but maybe

related to observed variables

• Random in the statistical sense of predictable

• Non-random (Selective) Missing (NMAR)• Some association with unobserved variables and maybe

with observed variables

4www.Quant.KU.edu

Key consideration of missing data handling

• Recoverability• Is it possible to estimate what scores would have

been if they were not missing?

• Bias• Are the statistics (e.g., means, variances/standard

deviations, and covariances/correlations) the same as what they would have been had there not been any missing data?

• Power• Do we have the same or similar rates of power (1 – Type II

error rate) as we would without missing data?

5www.Quant.KU.edu

Effects of imputing missing data

No Association

with Observed

Variable(s)

An Association

with Observed

Variable(s)

No Association

with Unobserved

/Unmeasured

Variable(s)

MCAR

•Fully

recoverable

•Fully unbiased

MAR

• Partly to fully recoverable

• Less biased to unbiased

An Association

with Unobserved

/Unmeasured

Variable(s)

NMAR

• Unrecoverable

• Biased (same bias as not

estimating)

MAR/NMAR

• Partly recoverable

• Same to unbiased

6www.Quant.KU.edu

No Association with

ANY Observed

Variable

An Association

with Analyzed

Variables

An Association

with Unanalyzed

Variables

No Association

with Unobserved

/Unmeasured

Variable(s)

MCAR

•Partly to fully

recoverable

•Fully unbiased

MAR



MAR



An Association

with Unobserved

/Unmeasured

Variable(s)

NMAR

• Unrecoverable

• Biased (same bias as not

estimating)

MAR/NMAR



MAR/NMAR



Effects of imputing missing data

Statistical Power: Will always be greater when missing data is imputed!

7www.Quant.KU.edu

Words to live by

• I pity the fool who does not impute – Mr. T.

• If you compute you must impute – Johnny Cochran

• Go forth and impute with impunity – me

• If math is Gods poetry, then statistics are Gods most elegantly reasoned prose –Bill Bukowski

8www.Quant.KU.edu

Modern Missing Data Analysis

• In 1977 Dempster, Laird, & Rubin formalized the Expectation Maximization Algorithm – Showed the generality of an idea that had been proposed numerous times

in specialized cases over the previous half century.

• Around the same time EM was introduced, Rubin proposed Multiple Imputation as an approach especially well suited for application in large public-use databases. – He first suggested the concept in 1978 and developed the technique more

fully in 1987.– Multiple Imputation estimation also include the Markov Chain Monte

Carlo (MCMC) algorithm.

• Following the development of the EM algorithm, likelihood approaches to missing data analysis have continued to develop rapidly over the past thirty years. – Today Full Information Maximum Likelihood is a widely used and very

powerful tool in the treatment of incomplete data.`

9www.Quant.KU.edu

Bad Missing Data Corrections

• List-wise Deletion• If a single data point is missing, delete subject• N is uniform but small• Variances biased, means biased• Acceptable only if power is not an issue and the

incomplete data is MCAR

• Pair-wise Deletion• If a data point is missing, delete paired data points

when calculating the correlation

• N varies per correlation• Variances biased, means biased• Matrix often non-positive definite• Acceptable only if power is not an issue and the

incomplete data is MCAR

10www.Quant.KU.edu

Bad Imputation Techniques

• Sample-wise Mean Substitution• Use the mean of the sample for any missing

value of a given individual

• Variances reduced• Correlations biased

• Subject-wise Mean Substitution• Use the mean score of other items for a given

missing value• Depends on the homogeneity of the items used• Is like regression imputation with regression weights

fixed at 1.0

11www.Quant.KU.edu

Questionable Imputation Techniques

• Regression Imputation – Focal Item Pool• Regress the variable with missing data on to

other items selected for a given analysis

• Variances reduced• Assumes MCAR and MAR

• Regression Imputation – Full Item Pool• Variances reduced• Attempts to account for NMAR in as much as

items in the pool correlate with the unobserved variables responsible for the missingness

12www.Quant.KU.edu

Questionable Imputation Techniques

• Stochastic Regression Imputation• Same as above but a random error component

is added to the imputed value to reduce the loss in variance

• Use of full item pool increases likelihood of greater recovery of missingness

13www.Quant.KU.edu

Good Model-based Imputation Techniques

• (But only if variables related to missingness are included in analysis, or missingness is MCAR)

• Multiple-group SEM• Each pattern of missingness is treated as a group and

equality constraints are placed across groups for all

estimates for which a group contributes observations

• Full Information Maximum Likelihood• Sufficient statistics are estimated with the Expectation

Maximization (EM) algorithm

• Those estimates then serve as the start values for the ML model estimation

14www.Quant.KU.edu

Good Data Imputation Techniques


• EM Imputation• Imputes the missing data values a number of times starting with

the E step

• The E(stimate)-step is a stochastic regression-based imputation

• The M(aximize)-step is to calculate a complete covariance matrix based on the estimated values.

• The E-step is repeated for each variable but the regression is now on the covariance matrix estimated from the first E-step.

• The M-step is repeated until the imputed estimates don’t differ from one iteration to the other

• MCMC imputation is a more flexible (but computer-intensive) algorithm.

15www.Quant.KU.edu

Good Data Imputation Techniques


• Multiple (EM or MCMC) Imputation• Impute N (say 20) datasets

• Each data set is based on a resampling plan of the original sample

• Mimics a random selection of another sample from the population

• Run your analyses N times• Calculate the mean and standard deviation of the N

analyses

16www.Quant.KU.edu

Missing Data and Estimation:

Missingness by Design

• Assessing all persons, but not all variables at each time of measurement• McArdle, Graham• Have core battery for all participants, but divide sample

into groups and each group has additional measures

• Control entry into study, to estimate magnitude of retesting effects• Randomly assign participants to their entry into a

longitudinal study

• Can be introduced easily (!?) into cohort studies• Can be key in providing unbiased estimates of growth or

change

17www.Quant.KU.edu

Form

Common

Variables

Variable

Set A

Variable

Set B

Variable

Set C

1 ¼ of

Variables

¼ of

Variables

¼ of

Variables

None

2 ¼ of

Variables

¼ of

Variables

none ¼ of

Variables

3 ¼ of

Variables

none ¼ of

Variables

¼ of

Variables

3-Form Intentionally Missing Design

18www.Quant.KU.edu

Estimate Missing Data With SASObs BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6

1 65 95 95 100 23 25 25 27

2 10 10 40 25 25 27 28 27

3 95 100 100 100 27 29 29 28

4 90 100 100 100 30 30 27 29

5 30 80 90 100 23 29 29 30

6 40 50 . . 28 27 3 3

7 40 70 100 95 29 29 30 30

8 95 100 100 100 28 30 29 30

9 50 80 75 85 26 29 27 25

10 55 100 100 100 30 30 30 30

11 50 100 100 100 30 27 30 24

12 70 95 100 100 28 28 28 29

13 100 100 100 100 30 30 30 30

14 75 90 100 100 30 30 29 30

15 0 5 10 . 3 3 3 .

19www.Quant.KU.edu

PROC MIPROC MI data=sample out=outmi

seed = 37851 nimpute=100

EM maxiter = 1000;

MCMC initial=em (maxiter=1000);

Var BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6;

run;

• out=• Designates output file for

imputed data

• nimpute = • # of imputed datasets• Default is 5

• Var• Variables to use in imputation

20www.Quant.KU.edu

PROC MI output: Imputed datasetObs _Imputation_ BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6

1 1 65 95 95 100 23 25 25 27

2 1 10 10 40 25 25 27 28 27

3 1 95 100 100 100 27 29 29 28

4 1 90 100 100 100 30 30 27 29

5 1 30 80 90 100 23 29 29 30

6 1 40 50 21 12 28 27 3 3

7 1 40 70 100 95 29 29 30 30

8 1 95 100 100 100 28 30 29 30

9 1 50 80 75 85 26 29 27 25

10 1 55 100 100 100 30 30 30 30

11 1 50 100 100 100 30 27 30 24

12 1 70 95 100 100 28 28 28 29

13 1 100 100 100 100 30 30 30 30

14 1 75 90 100 100 30 30 29 30

15 1 0 5 10 8 3 3 3 2

21www.Quant.KU.edu

Expectation Maximization Algorithm

• The EM algorithm is a two step iterative process• First estimates the expected conditional loglikelihood of the data

given a function of the missing values (i.e., sufficient statistics) in the E-step

• Second maximize this conditional loglikelihood as if it were the complete data loglikelihood to find values for the parameters in the M-step.

• These steps continue until the posterior distribution of the estimates reaches stationarity.

• Formally, the E-step is

and the M-step is

( ) ( )( | ) ( | ) ( | , )t tmiss obs missQ y f Y Y dY

for all θ( 1) ( ) ( )( | ) ( | ),t t tQ Q

22www.Quant.KU.edu

Full Information Maximum Likelihood

• FIML maximizes the casewise -2loglikelihood of the available data to compute an individual mean vector and covariance matrix for every observation.• Since each observation’s mean vector and covariance matrix is

based on its own unique response pattern, there is no need to fill in the missing data.

• Each individual likelihood function is then summed to create a combined likelihood function for the whole data frame.• Individual likelihood functions with greater amounts of missing

are given less weight in the final combined likelihood function than those will a more complete response pattern, thus controlling for the loss of information.

• Formally, the function that FIML is maximizing is

where

1

12 log ( ) ( ) ,

N

com i i i i i i iiK

y y

log(2 )i iK p

23www.Quant.KU.edu

Multiple Imputation

• Multiple imputation involves generating m imputed datasets (usually between 20 and 100), running the analysis model on each of these datasets, and combining the m sets of results to make inferences.– By filling in m separate estimates for each missing value we can

account for the uncertainty in that datum’s true population value.

• Data sets can be generated in a number of ways, but the two most common approaches are through an MCMC simulation technique such as Tanner & Wong’s (1987) Data Augmentationalgorithm or through bootstrapping likelihood estimates, such as the bootstrapped EM algorithm used by Amelia II.– SAS uses data augmentation to pull random draws from a specified

posterior distribution (i.e., stationary distribution of EM estimates).

• After m data sets have been created and the analysis model has been run on each separately, the resulting estimates are usually combined with Rubin’s Rules (Rubin, 1987).

24www.Quant.KU.edu

Rubin’s Rules

• Formalized in Rubin’s (1987) book Multiple Imputation for Sample Surveys, Rubin’s Rules are a simple set of equations that

allow unbiased* hypothesis testing by taking into consideration

both the between and within imputation sources of variance.

( )

1

1 tmt

Q Qm

( )

1

1 tmt

U Um

( ) ( )1

1

Tt t

B Q Q Q Qm

1(1 )T U m B

Point Estimates Within Imputation Variance

Between Imputation Variance Total Variance

* See Motivation section

25www.Quant.KU.edu

Rubin’s Rules

• Point Estimates (Qbar): The average estimate across the M

imputations.

• Within Imputation Variance (Ubar): The average of the

squared standard errors of a given parameter estimate

• Between Imputation Variance (B): The observed variance,

across imputations, in the parameter estimate

• Combined Variance: Ubar plus B weighted by the number

of imputations

• Standard Error for parameter is the square-root of T.

26www.Quant.KU.edu

Fraction Missing

• Fraction Missing is a measure of information lost to nonresponse that takes into account the strength of the

relationships in the data.

– By accounting for the strength of the relationships in the data, Fraction Missing is a better measure of how well MI will recapture

the true population covariance matrix than is Percent Missing.

• Fraction Missing is estimated after the m imputations• Fraction Missing is only reliably estimated with large

numbers of imputations (e.g., m = 100).

2 / ( 3)

1

r df

r

1(1 )m Br

U

where

Formally, the formula for Fraction Missing is:

27www.Quant.KU.edu

Appropriate m by Fraction Missing

• Graham, Olchowski, & Gilreath (2007) offered guidance on appropriate m based on acceptable power fall-off and fraction missing (see Table 5).

γ

28www.Quant.KU.edu

• EM doesn’t account for uncertainty in our estimates of the missing data, so it will lead to artificially precise estimates of the variance/covariance structure. – Although there has been work done to correct for this artificial precision using

correction terms (Savalei & Bentler, 2009), it is more parsimonious to base our technique on MI which implicitly accounts for the uncertainty in the estimates.

• FIML can become cumbersome to implement when a large number of covariates are require to support the assumption of MAR missingness. – Including covariates in the imputation model is simple, and there is no need to

include every variable in the imputation model in the analysis model.

• Parameter significance depends on the method of scale setting (Gonzalez & Griffin, 2001).– Combining standard errors using Rubin’s rules may lead to biased estimates

depending on the method of identification

• Chi-squared difference tests are unbiased (Gonzalez & Griffin, 2001). – The change in chi-squared test should lead us to the most accurate test for

parameter significance.

Motivation

29www.Quant.KU.edu

• Generate multiply imputed datasets.

• Calculate a single covariance matrix on all N*m observations.– By combining information from all m datasets, this matrix should

represent the best estimate of the population associations.

• Run the Analysis model on this single covariance matrix and use the resulting estimates as the basis for inference and hypothesis testing.– The fit function from this approach should be the best basis for making

inferences about model fit and significance.

• Using a Monte Carlo Simulation, we test the hypothesis that this approach is reasonable.

Proposed Solution

Population Model

A6A5A4A3A2

Factor A Factor B

.81 .72

1* 1*

Note: These are fully

standardized parameter

estimates

A7 A8 A10 B6B2 B3 B4 B5 B7B1 B8 B9 B10A1

.74 .70 .71 .79 .69 .81 .73

A9

.78

.35 .49 .45 .52 .50 .38 .53 .35 .47 .39

.75 .68 .76 .70 .72 .67 .69 .79 .72 .75

.44 .53 .42 .51 .48 .55 .52 .38 .49 .43

.52

RMSEA = .047, CFI = .967, TLI = .962, SRMR = .021

31www.Quant.KU.edu

Absolute Fit

Naïve Approach

Chi Squared Across Replications

Pop

ulat

ion

10%

Mis

sing

30%

Mis

sing

50%

Mis

sing

100

200

300

400

500

600

700

Condition

Ch

i S

qu

are

d

Condition PRB

10%

Missing20.78%

30%

Missing80.33%

50%

Missing186.45%

32www.Quant.KU.edu

Absolute Fit

Correlation Matrix Technique

Condition PRB

10%

Missing8.76%

30%

Missing32.69%

50%

Missing70.46%

Chi Squared Across Replications

Pop

ulat

ion

10%

Mis

sing

30%

Mis

sing

50%

Mis

sing

100

150

200

250

300

350

400

450

Condition

Ch

i S

qu

are

d

33www.Quant.KU.edu

Comparative Fit Index


Condition PRB

10%

Missing-3.22%

30%

Missing-11.51%

50%

Missing-23.05%

CFI Across Replications

Pop

ulat

ion

10%

Mis

sing

30%

Mis

sing

50%

Mis

sing

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Condition

CF

I

34www.Quant.KU.edu

Standardized Root Mean Residual


SRMR Across Replications

Pop

ulat

ion

10%

Mis

sing

30%

Mis

sing

50%

Mis

sing

0.04

0.05

0.06

0.07

0.08

Condition

SR

MR

Condition PRB

10%

Missing-0.42%

30%

Missing9.33%

50%

Missing23.62%

35www.Quant.KU.edu

Standardized Root Mean Residual

Naïve Approach

SRMR Across Replications

Pop

ulat

ion

10%

Mis

sing

30%

Mis

sing

50%

Mis

sing

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.11

Condition

SR

MR

Condition PRB

10%

Missing9.93%

30%

Missing34.03%

50%

Missing69.62%

36www.Quant.KU.edu

Change in Chi Squared Test


Change in Chi Squared Across Replications

Pop

ulat

ion

10%

Mis

sing

30%

Mis

sing

50%

Mis

sing

0

15

30

45

60

75

90

Condition

Ch

an

ge in

Ch

i S

qu

are

d

Condition PRB

10%

Missing-2.95%

30%

Missing4.39%

50%

Missing6.08%

missing data imputation: what it is and why its not cheating · 2014. 11. 7. · 1 missing data...

Documents