missing data imputation: what it is and why its not cheating · 2014. 11. 7. · 1 missing data...
TRANSCRIPT
-
1www.Quant.KU.edu
Missing Data Imputation: What
it is and why its not cheating
Todd D. Little
University of KansasDirector, Quantitative Training Program
Director, Center for Research Methods and Data Analysis
Director, Undergraduate Research Methods and Data Analysis Minor
Member, Developmental Psychology Training Program
www. uant. .edu
Workshop presented 02-18-2010, University of Arizona
-
2www.Quant.KU.edu
Missing Data
• Learn about the different types of missing data
• Learn about ways in which the missing data process
can be recovered
• Learn why not imputing missing data is likely to lead
to errors in generalization
• Understand why imputing missing data is not cheating
• Introduce a simple method for significance testing
• Discuss imputation with large datasets
• Questions about your missing data issues?
-
3www.Quant.KU.edu
Types of Missing Data
• Missing Completely at Random (MCAR)• No association with unobserved variables (selective
process) and no association with observed variables
• Missing at Random (MAR)• No association with unobserved variables, but maybe
related to observed variables
• Random in the statistical sense of predictable
• Non-random (Selective) Missing (NMAR)• Some association with unobserved variables and maybe
with observed variables
-
4www.Quant.KU.edu
Key consideration of missing data handling
• Recoverability• Is it possible to estimate what scores would have
been if they were not missing?
• Bias• Are the statistics (e.g., means, variances/standard
deviations, and covariances/correlations) the same as what they would have been had there not been any missing data?
• Power• Do we have the same or similar rates of power (1 – Type II
error rate) as we would without missing data?
-
5www.Quant.KU.edu
Effects of imputing missing data
No Association
with Observed
Variable(s)
An Association
with Observed
Variable(s)
No Association
with Unobserved
/Unmeasured
Variable(s)
MCAR
•Fully
recoverable
•Fully unbiased
MAR
• Partly to fully recoverable
• Less biased to unbiased
An Association
with Unobserved
/Unmeasured
Variable(s)
NMAR
• Unrecoverable
• Biased (same bias as not
estimating)
MAR/NMAR
• Partly recoverable
• Same to unbiased
-
6www.Quant.KU.edu
No Association with
ANY Observed
Variable
An Association
with Analyzed
Variables
An Association
with Unanalyzed
Variables
No Association
with Unobserved
/Unmeasured
Variable(s)
MCAR
•Partly to fully
recoverable
•Fully unbiased
MAR
• Partly to fully recoverable
• Less biased to unbiased
MAR
• Partly to fully recoverable
• Less biased to unbiased
An Association
with Unobserved
/Unmeasured
Variable(s)
NMAR
• Unrecoverable
• Biased (same bias as not
estimating)
MAR/NMAR
• Partly to fully recoverable
• Same to unbiased
MAR/NMAR
• Partly to fully recoverable
• Same to unbiased
Effects of imputing missing data
Statistical Power: Will always be greater when missing data is imputed!
-
7www.Quant.KU.edu
Words to live by
• I pity the fool who does not impute – Mr. T.
• If you compute you must impute – Johnny Cochran
• Go forth and impute with impunity – me
• If math is Gods poetry, then statistics are Gods most elegantly reasoned prose –Bill Bukowski
-
8www.Quant.KU.edu
Modern Missing Data Analysis
• In 1977 Dempster, Laird, & Rubin formalized the Expectation Maximization Algorithm – Showed the generality of an idea that had been proposed numerous times
in specialized cases over the previous half century.
• Around the same time EM was introduced, Rubin proposed Multiple Imputation as an approach especially well suited for application in large public-use databases. – He first suggested the concept in 1978 and developed the technique more
fully in 1987.– Multiple Imputation estimation also include the Markov Chain Monte
Carlo (MCMC) algorithm.
• Following the development of the EM algorithm, likelihood approaches to missing data analysis have continued to develop rapidly over the past thirty years. – Today Full Information Maximum Likelihood is a widely used and very
powerful tool in the treatment of incomplete data.`
-
9www.Quant.KU.edu
Bad Missing Data Corrections
• List-wise Deletion• If a single data point is missing, delete subject• N is uniform but small• Variances biased, means biased• Acceptable only if power is not an issue and the
incomplete data is MCAR
• Pair-wise Deletion• If a data point is missing, delete paired data points
when calculating the correlation
• N varies per correlation• Variances biased, means biased• Matrix often non-positive definite• Acceptable only if power is not an issue and the
incomplete data is MCAR
-
10www.Quant.KU.edu
Bad Imputation Techniques
• Sample-wise Mean Substitution• Use the mean of the sample for any missing
value of a given individual
• Variances reduced• Correlations biased
• Subject-wise Mean Substitution• Use the mean score of other items for a given
missing value• Depends on the homogeneity of the items used• Is like regression imputation with regression weights
fixed at 1.0
-
11www.Quant.KU.edu
Questionable Imputation Techniques
• Regression Imputation – Focal Item Pool• Regress the variable with missing data on to
other items selected for a given analysis
• Variances reduced• Assumes MCAR and MAR
• Regression Imputation – Full Item Pool• Variances reduced• Attempts to account for NMAR in as much as
items in the pool correlate with the unobserved variables responsible for the missingness
-
12www.Quant.KU.edu
Questionable Imputation Techniques
• Stochastic Regression Imputation• Same as above but a random error component
is added to the imputed value to reduce the loss in variance
• Use of full item pool increases likelihood of greater recovery of missingness
-
13www.Quant.KU.edu
Good Model-based Imputation Techniques
• (But only if variables related to missingness are included in analysis, or missingness is MCAR)
• Multiple-group SEM• Each pattern of missingness is treated as a group and
equality constraints are placed across groups for all
estimates for which a group contributes observations
• Full Information Maximum Likelihood• Sufficient statistics are estimated with the Expectation
Maximization (EM) algorithm
• Those estimates then serve as the start values for the ML model estimation
-
14www.Quant.KU.edu
Good Data Imputation Techniques
• (But only if variables related to missingness are included in analysis, or missingness is MCAR)
• EM Imputation• Imputes the missing data values a number of times starting with
the E step
• The E(stimate)-step is a stochastic regression-based imputation
• The M(aximize)-step is to calculate a complete covariance matrix based on the estimated values.
• The E-step is repeated for each variable but the regression is now on the covariance matrix estimated from the first E-step.
• The M-step is repeated until the imputed estimates don’t differ from one iteration to the other
• MCMC imputation is a more flexible (but computer-intensive) algorithm.
-
15www.Quant.KU.edu
Good Data Imputation Techniques
• (But only if variables related to missingness are included in analysis, or missingness is MCAR)
• Multiple (EM or MCMC) Imputation• Impute N (say 20) datasets
• Each data set is based on a resampling plan of the original sample
• Mimics a random selection of another sample from the population
• Run your analyses N times• Calculate the mean and standard deviation of the N
analyses
-
16www.Quant.KU.edu
Missing Data and Estimation:
Missingness by Design
• Assessing all persons, but not all variables at each time of measurement• McArdle, Graham• Have core battery for all participants, but divide sample
into groups and each group has additional measures
• Control entry into study, to estimate magnitude of retesting effects• Randomly assign participants to their entry into a
longitudinal study
• Can be introduced easily (!?) into cohort studies• Can be key in providing unbiased estimates of growth or
change
-
17www.Quant.KU.edu
Form
Common
Variables
Variable
Set A
Variable
Set B
Variable
Set C
1 ¼ of
Variables
¼ of
Variables
¼ of
Variables
None
2 ¼ of
Variables
¼ of
Variables
none ¼ of
Variables
3 ¼ of
Variables
none ¼ of
Variables
¼ of
Variables
3-Form Intentionally Missing Design
-
18www.Quant.KU.edu
Estimate Missing Data With SASObs BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6
1 65 95 95 100 23 25 25 27
2 10 10 40 25 25 27 28 27
3 95 100 100 100 27 29 29 28
4 90 100 100 100 30 30 27 29
5 30 80 90 100 23 29 29 30
6 40 50 . . 28 27 3 3
7 40 70 100 95 29 29 30 30
8 95 100 100 100 28 30 29 30
9 50 80 75 85 26 29 27 25
10 55 100 100 100 30 30 30 30
11 50 100 100 100 30 27 30 24
12 70 95 100 100 28 28 28 29
13 100 100 100 100 30 30 30 30
14 75 90 100 100 30 30 29 30
15 0 5 10 . 3 3 3 .
-
19www.Quant.KU.edu
PROC MIPROC MI data=sample out=outmi
seed = 37851 nimpute=100
EM maxiter = 1000;
MCMC initial=em (maxiter=1000);
Var BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6;
run;
• out=• Designates output file for
imputed data
• nimpute = • # of imputed datasets• Default is 5
• Var• Variables to use in imputation
-
20www.Quant.KU.edu
PROC MI output: Imputed datasetObs _Imputation_ BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6
1 1 65 95 95 100 23 25 25 27
2 1 10 10 40 25 25 27 28 27
3 1 95 100 100 100 27 29 29 28
4 1 90 100 100 100 30 30 27 29
5 1 30 80 90 100 23 29 29 30
6 1 40 50 21 12 28 27 3 3
7 1 40 70 100 95 29 29 30 30
8 1 95 100 100 100 28 30 29 30
9 1 50 80 75 85 26 29 27 25
10 1 55 100 100 100 30 30 30 30
11 1 50 100 100 100 30 27 30 24
12 1 70 95 100 100 28 28 28 29
13 1 100 100 100 100 30 30 30 30
14 1 75 90 100 100 30 30 29 30
15 1 0 5 10 8 3 3 3 2
-
21www.Quant.KU.edu
Expectation Maximization Algorithm
• The EM algorithm is a two step iterative process• First estimates the expected conditional loglikelihood of the data
given a function of the missing values (i.e., sufficient statistics) in the E-step
• Second maximize this conditional loglikelihood as if it were the complete data loglikelihood to find values for the parameters in the M-step.
• These steps continue until the posterior distribution of the estimates reaches stationarity.
• Formally, the E-step is
and the M-step is
( ) ( )( | ) ( | ) ( | , )t tmiss obs missQ y f Y Y dY
for all θ( 1) ( ) ( )( | ) ( | ),t t tQ Q
-
22www.Quant.KU.edu
Full Information Maximum Likelihood
• FIML maximizes the casewise -2loglikelihood of the available data to compute an individual mean vector and covariance matrix for every observation.• Since each observation’s mean vector and covariance matrix is
based on its own unique response pattern, there is no need to fill in the missing data.
• Each individual likelihood function is then summed to create a combined likelihood function for the whole data frame.• Individual likelihood functions with greater amounts of missing
are given less weight in the final combined likelihood function than those will a more complete response pattern, thus controlling for the loss of information.
• Formally, the function that FIML is maximizing is
where
1
12 log ( ) ( ) ,
N
com i i i i i i iiK
y y
log(2 )i iK p
-
23www.Quant.KU.edu
Multiple Imputation
• Multiple imputation involves generating m imputed datasets (usually between 20 and 100), running the analysis model on each of these datasets, and combining the m sets of results to make inferences.– By filling in m separate estimates for each missing value we can
account for the uncertainty in that datum’s true population value.
• Data sets can be generated in a number of ways, but the two most common approaches are through an MCMC simulation technique such as Tanner & Wong’s (1987) Data Augmentationalgorithm or through bootstrapping likelihood estimates, such as the bootstrapped EM algorithm used by Amelia II.– SAS uses data augmentation to pull random draws from a specified
posterior distribution (i.e., stationary distribution of EM estimates).
• After m data sets have been created and the analysis model has been run on each separately, the resulting estimates are usually combined with Rubin’s Rules (Rubin, 1987).
-
24www.Quant.KU.edu
Rubin’s Rules
• Formalized in Rubin’s (1987) book Multiple Imputation for Sample Surveys, Rubin’s Rules are a simple set of equations that
allow unbiased* hypothesis testing by taking into consideration
both the between and within imputation sources of variance.
( )
1
1 tmt
Q Qm
( )
1
1 tmt
U Um
( ) ( )1
1
Tt t
B Q Q Q Qm
1(1 )T U m B
Point Estimates Within Imputation Variance
Between Imputation Variance Total Variance
* See Motivation section
-
25www.Quant.KU.edu
Rubin’s Rules
• Point Estimates (Qbar): The average estimate across the M
imputations.
• Within Imputation Variance (Ubar): The average of the
squared standard errors of a given parameter estimate
• Between Imputation Variance (B): The observed variance,
across imputations, in the parameter estimate
• Combined Variance: Ubar plus B weighted by the number
of imputations
• Standard Error for parameter is the square-root of T.
-
26www.Quant.KU.edu
Fraction Missing
• Fraction Missing is a measure of information lost to nonresponse that takes into account the strength of the
relationships in the data.
– By accounting for the strength of the relationships in the data, Fraction Missing is a better measure of how well MI will recapture
the true population covariance matrix than is Percent Missing.
• Fraction Missing is estimated after the m imputations• Fraction Missing is only reliably estimated with large
numbers of imputations (e.g., m = 100).
2 / ( 3)
1
r df
r
1(1 )m Br
U
where
Formally, the formula for Fraction Missing is:
-
27www.Quant.KU.edu
Appropriate m by Fraction Missing
• Graham, Olchowski, & Gilreath (2007) offered guidance on appropriate m based on acceptable power fall-off and fraction missing (see Table 5).
γ
-
28www.Quant.KU.edu
• EM doesn’t account for uncertainty in our estimates of the missing data, so it will lead to artificially precise estimates of the variance/covariance structure. – Although there has been work done to correct for this artificial precision using
correction terms (Savalei & Bentler, 2009), it is more parsimonious to base our technique on MI which implicitly accounts for the uncertainty in the estimates.
• FIML can become cumbersome to implement when a large number of covariates are require to support the assumption of MAR missingness. – Including covariates in the imputation model is simple, and there is no need to
include every variable in the imputation model in the analysis model.
• Parameter significance depends on the method of scale setting (Gonzalez & Griffin, 2001).– Combining standard errors using Rubin’s rules may lead to biased estimates
depending on the method of identification
• Chi-squared difference tests are unbiased (Gonzalez & Griffin, 2001). – The change in chi-squared test should lead us to the most accurate test for
parameter significance.
Motivation
-
29www.Quant.KU.edu
• Generate multiply imputed datasets.
• Calculate a single covariance matrix on all N*m observations.– By combining information from all m datasets, this matrix should
represent the best estimate of the population associations.
• Run the Analysis model on this single covariance matrix and use the resulting estimates as the basis for inference and hypothesis testing.– The fit function from this approach should be the best basis for making
inferences about model fit and significance.
• Using a Monte Carlo Simulation, we test the hypothesis that this approach is reasonable.
Proposed Solution
-
Population Model
A6A5A4A3A2
Factor A Factor B
.81 .72
1* 1*
Note: These are fully
standardized parameter
estimates
A7 A8 A10 B6B2 B3 B4 B5 B7B1 B8 B9 B10A1
.74 .70 .71 .79 .69 .81 .73
A9
.78
.35 .49 .45 .52 .50 .38 .53 .35 .47 .39
.75 .68 .76 .70 .72 .67 .69 .79 .72 .75
.44 .53 .42 .51 .48 .55 .52 .38 .49 .43
.52
RMSEA = .047, CFI = .967, TLI = .962, SRMR = .021
-
31www.Quant.KU.edu
Absolute Fit
Naïve Approach
Chi Squared Across Replications
Pop
ulat
ion
10%
Mis
sing
30%
Mis
sing
50%
Mis
sing
100
200
300
400
500
600
700
Condition
Ch
i S
qu
are
d
Condition PRB
10%
Missing20.78%
30%
Missing80.33%
50%
Missing186.45%
-
32www.Quant.KU.edu
Absolute Fit
Correlation Matrix Technique
Condition PRB
10%
Missing8.76%
30%
Missing32.69%
50%
Missing70.46%
Chi Squared Across Replications
Pop
ulat
ion
10%
Mis
sing
30%
Mis
sing
50%
Mis
sing
100
150
200
250
300
350
400
450
Condition
Ch
i S
qu
are
d
-
33www.Quant.KU.edu
Comparative Fit Index
Correlation Matrix Technique
Condition PRB
10%
Missing-3.22%
30%
Missing-11.51%
50%
Missing-23.05%
CFI Across Replications
Pop
ulat
ion
10%
Mis
sing
30%
Mis
sing
50%
Mis
sing
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Condition
CF
I
-
34www.Quant.KU.edu
Standardized Root Mean Residual
Correlation Matrix Technique
SRMR Across Replications
Pop
ulat
ion
10%
Mis
sing
30%
Mis
sing
50%
Mis
sing
0.04
0.05
0.06
0.07
0.08
Condition
SR
MR
Condition PRB
10%
Missing-0.42%
30%
Missing9.33%
50%
Missing23.62%
-
35www.Quant.KU.edu
Standardized Root Mean Residual
Naïve Approach
SRMR Across Replications
Pop
ulat
ion
10%
Mis
sing
30%
Mis
sing
50%
Mis
sing
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.11
Condition
SR
MR
Condition PRB
10%
Missing9.93%
30%
Missing34.03%
50%
Missing69.62%
-
36www.Quant.KU.edu
Change in Chi Squared Test
Correlation Matrix Technique
Change in Chi Squared Across Replications
Pop
ulat
ion
10%
Mis
sing
30%
Mis
sing
50%
Mis
sing
0
15
30
45
60
75
90
Condition
Ch
an
ge in
Ch
i S
qu
are
d
Condition PRB
10%
Missing-2.95%
30%
Missing4.39%
50%
Missing6.08%