missing data imputation: what it is and why its not cheating · 2014. 11. 7. · 1 missing data...

36
1 www.Quant.KU.edu Missing Data Imputation: What it is and why its not cheating Todd D. Little University of Kansas Director, Quantitative Training Program Director, Center for Research Methods and Data Analysis Director, Undergraduate Research Methods and Data Analysis Minor Member, Developmental Psychology Training Program www. uant. .edu Workshop presented 02-18-2010, University of Arizona

Upload: others

Post on 02-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • 1www.Quant.KU.edu

    Missing Data Imputation: What

    it is and why its not cheating

    Todd D. Little

    University of KansasDirector, Quantitative Training Program

    Director, Center for Research Methods and Data Analysis

    Director, Undergraduate Research Methods and Data Analysis Minor

    Member, Developmental Psychology Training Program

    www. uant. .edu

    Workshop presented 02-18-2010, University of Arizona

  • 2www.Quant.KU.edu

    Missing Data

    • Learn about the different types of missing data

    • Learn about ways in which the missing data process

    can be recovered

    • Learn why not imputing missing data is likely to lead

    to errors in generalization

    • Understand why imputing missing data is not cheating

    • Introduce a simple method for significance testing

    • Discuss imputation with large datasets

    • Questions about your missing data issues?

  • 3www.Quant.KU.edu

    Types of Missing Data

    • Missing Completely at Random (MCAR)• No association with unobserved variables (selective

    process) and no association with observed variables

    • Missing at Random (MAR)• No association with unobserved variables, but maybe

    related to observed variables

    • Random in the statistical sense of predictable

    • Non-random (Selective) Missing (NMAR)• Some association with unobserved variables and maybe

    with observed variables

  • 4www.Quant.KU.edu

    Key consideration of missing data handling

    • Recoverability• Is it possible to estimate what scores would have

    been if they were not missing?

    • Bias• Are the statistics (e.g., means, variances/standard

    deviations, and covariances/correlations) the same as what they would have been had there not been any missing data?

    • Power• Do we have the same or similar rates of power (1 – Type II

    error rate) as we would without missing data?

  • 5www.Quant.KU.edu

    Effects of imputing missing data

    No Association

    with Observed

    Variable(s)

    An Association

    with Observed

    Variable(s)

    No Association

    with Unobserved

    /Unmeasured

    Variable(s)

    MCAR

    •Fully

    recoverable

    •Fully unbiased

    MAR

    • Partly to fully recoverable

    • Less biased to unbiased

    An Association

    with Unobserved

    /Unmeasured

    Variable(s)

    NMAR

    • Unrecoverable

    • Biased (same bias as not

    estimating)

    MAR/NMAR

    • Partly recoverable

    • Same to unbiased

  • 6www.Quant.KU.edu

    No Association with

    ANY Observed

    Variable

    An Association

    with Analyzed

    Variables

    An Association

    with Unanalyzed

    Variables

    No Association

    with Unobserved

    /Unmeasured

    Variable(s)

    MCAR

    •Partly to fully

    recoverable

    •Fully unbiased

    MAR

    • Partly to fully recoverable

    • Less biased to unbiased

    MAR

    • Partly to fully recoverable

    • Less biased to unbiased

    An Association

    with Unobserved

    /Unmeasured

    Variable(s)

    NMAR

    • Unrecoverable

    • Biased (same bias as not

    estimating)

    MAR/NMAR

    • Partly to fully recoverable

    • Same to unbiased

    MAR/NMAR

    • Partly to fully recoverable

    • Same to unbiased

    Effects of imputing missing data

    Statistical Power: Will always be greater when missing data is imputed!

  • 7www.Quant.KU.edu

    Words to live by

    • I pity the fool who does not impute – Mr. T.

    • If you compute you must impute – Johnny Cochran

    • Go forth and impute with impunity – me

    • If math is Gods poetry, then statistics are Gods most elegantly reasoned prose –Bill Bukowski

  • 8www.Quant.KU.edu

    Modern Missing Data Analysis

    • In 1977 Dempster, Laird, & Rubin formalized the Expectation Maximization Algorithm – Showed the generality of an idea that had been proposed numerous times

    in specialized cases over the previous half century.

    • Around the same time EM was introduced, Rubin proposed Multiple Imputation as an approach especially well suited for application in large public-use databases. – He first suggested the concept in 1978 and developed the technique more

    fully in 1987.– Multiple Imputation estimation also include the Markov Chain Monte

    Carlo (MCMC) algorithm.

    • Following the development of the EM algorithm, likelihood approaches to missing data analysis have continued to develop rapidly over the past thirty years. – Today Full Information Maximum Likelihood is a widely used and very

    powerful tool in the treatment of incomplete data.`

  • 9www.Quant.KU.edu

    Bad Missing Data Corrections

    • List-wise Deletion• If a single data point is missing, delete subject• N is uniform but small• Variances biased, means biased• Acceptable only if power is not an issue and the

    incomplete data is MCAR

    • Pair-wise Deletion• If a data point is missing, delete paired data points

    when calculating the correlation

    • N varies per correlation• Variances biased, means biased• Matrix often non-positive definite• Acceptable only if power is not an issue and the

    incomplete data is MCAR

  • 10www.Quant.KU.edu

    Bad Imputation Techniques

    • Sample-wise Mean Substitution• Use the mean of the sample for any missing

    value of a given individual

    • Variances reduced• Correlations biased

    • Subject-wise Mean Substitution• Use the mean score of other items for a given

    missing value• Depends on the homogeneity of the items used• Is like regression imputation with regression weights

    fixed at 1.0

  • 11www.Quant.KU.edu

    Questionable Imputation Techniques

    • Regression Imputation – Focal Item Pool• Regress the variable with missing data on to

    other items selected for a given analysis

    • Variances reduced• Assumes MCAR and MAR

    • Regression Imputation – Full Item Pool• Variances reduced• Attempts to account for NMAR in as much as

    items in the pool correlate with the unobserved variables responsible for the missingness

  • 12www.Quant.KU.edu

    Questionable Imputation Techniques

    • Stochastic Regression Imputation• Same as above but a random error component

    is added to the imputed value to reduce the loss in variance

    • Use of full item pool increases likelihood of greater recovery of missingness

  • 13www.Quant.KU.edu

    Good Model-based Imputation Techniques

    • (But only if variables related to missingness are included in analysis, or missingness is MCAR)

    • Multiple-group SEM• Each pattern of missingness is treated as a group and

    equality constraints are placed across groups for all

    estimates for which a group contributes observations

    • Full Information Maximum Likelihood• Sufficient statistics are estimated with the Expectation

    Maximization (EM) algorithm

    • Those estimates then serve as the start values for the ML model estimation

  • 14www.Quant.KU.edu

    Good Data Imputation Techniques

    • (But only if variables related to missingness are included in analysis, or missingness is MCAR)

    • EM Imputation• Imputes the missing data values a number of times starting with

    the E step

    • The E(stimate)-step is a stochastic regression-based imputation

    • The M(aximize)-step is to calculate a complete covariance matrix based on the estimated values.

    • The E-step is repeated for each variable but the regression is now on the covariance matrix estimated from the first E-step.

    • The M-step is repeated until the imputed estimates don’t differ from one iteration to the other

    • MCMC imputation is a more flexible (but computer-intensive) algorithm.

  • 15www.Quant.KU.edu

    Good Data Imputation Techniques

    • (But only if variables related to missingness are included in analysis, or missingness is MCAR)

    • Multiple (EM or MCMC) Imputation• Impute N (say 20) datasets

    • Each data set is based on a resampling plan of the original sample

    • Mimics a random selection of another sample from the population

    • Run your analyses N times• Calculate the mean and standard deviation of the N

    analyses

  • 16www.Quant.KU.edu

    Missing Data and Estimation:

    Missingness by Design

    • Assessing all persons, but not all variables at each time of measurement• McArdle, Graham• Have core battery for all participants, but divide sample

    into groups and each group has additional measures

    • Control entry into study, to estimate magnitude of retesting effects• Randomly assign participants to their entry into a

    longitudinal study

    • Can be introduced easily (!?) into cohort studies• Can be key in providing unbiased estimates of growth or

    change

  • 17www.Quant.KU.edu

    Form

    Common

    Variables

    Variable

    Set A

    Variable

    Set B

    Variable

    Set C

    1 ¼ of

    Variables

    ¼ of

    Variables

    ¼ of

    Variables

    None

    2 ¼ of

    Variables

    ¼ of

    Variables

    none ¼ of

    Variables

    3 ¼ of

    Variables

    none ¼ of

    Variables

    ¼ of

    Variables

    3-Form Intentionally Missing Design

  • 18www.Quant.KU.edu

    Estimate Missing Data With SASObs BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6

    1 65 95 95 100 23 25 25 27

    2 10 10 40 25 25 27 28 27

    3 95 100 100 100 27 29 29 28

    4 90 100 100 100 30 30 27 29

    5 30 80 90 100 23 29 29 30

    6 40 50 . . 28 27 3 3

    7 40 70 100 95 29 29 30 30

    8 95 100 100 100 28 30 29 30

    9 50 80 75 85 26 29 27 25

    10 55 100 100 100 30 30 30 30

    11 50 100 100 100 30 27 30 24

    12 70 95 100 100 28 28 28 29

    13 100 100 100 100 30 30 30 30

    14 75 90 100 100 30 30 29 30

    15 0 5 10 . 3 3 3 .

  • 19www.Quant.KU.edu

    PROC MIPROC MI data=sample out=outmi

    seed = 37851 nimpute=100

    EM maxiter = 1000;

    MCMC initial=em (maxiter=1000);

    Var BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6;

    run;

    • out=• Designates output file for

    imputed data

    • nimpute = • # of imputed datasets• Default is 5

    • Var• Variables to use in imputation

  • 20www.Quant.KU.edu

    PROC MI output: Imputed datasetObs _Imputation_ BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6

    1 1 65 95 95 100 23 25 25 27

    2 1 10 10 40 25 25 27 28 27

    3 1 95 100 100 100 27 29 29 28

    4 1 90 100 100 100 30 30 27 29

    5 1 30 80 90 100 23 29 29 30

    6 1 40 50 21 12 28 27 3 3

    7 1 40 70 100 95 29 29 30 30

    8 1 95 100 100 100 28 30 29 30

    9 1 50 80 75 85 26 29 27 25

    10 1 55 100 100 100 30 30 30 30

    11 1 50 100 100 100 30 27 30 24

    12 1 70 95 100 100 28 28 28 29

    13 1 100 100 100 100 30 30 30 30

    14 1 75 90 100 100 30 30 29 30

    15 1 0 5 10 8 3 3 3 2

  • 21www.Quant.KU.edu

    Expectation Maximization Algorithm

    • The EM algorithm is a two step iterative process• First estimates the expected conditional loglikelihood of the data

    given a function of the missing values (i.e., sufficient statistics) in the E-step

    • Second maximize this conditional loglikelihood as if it were the complete data loglikelihood to find values for the parameters in the M-step.

    • These steps continue until the posterior distribution of the estimates reaches stationarity.

    • Formally, the E-step is

    and the M-step is

    ( ) ( )( | ) ( | ) ( | , )t tmiss obs missQ y f Y Y dY

    for all θ( 1) ( ) ( )( | ) ( | ),t t tQ Q

  • 22www.Quant.KU.edu

    Full Information Maximum Likelihood

    • FIML maximizes the casewise -2loglikelihood of the available data to compute an individual mean vector and covariance matrix for every observation.• Since each observation’s mean vector and covariance matrix is

    based on its own unique response pattern, there is no need to fill in the missing data.

    • Each individual likelihood function is then summed to create a combined likelihood function for the whole data frame.• Individual likelihood functions with greater amounts of missing

    are given less weight in the final combined likelihood function than those will a more complete response pattern, thus controlling for the loss of information.

    • Formally, the function that FIML is maximizing is

    where

    1

    12 log ( ) ( ) ,

    N

    com i i i i i i iiK

    y y

    log(2 )i iK p

  • 23www.Quant.KU.edu

    Multiple Imputation

    • Multiple imputation involves generating m imputed datasets (usually between 20 and 100), running the analysis model on each of these datasets, and combining the m sets of results to make inferences.– By filling in m separate estimates for each missing value we can

    account for the uncertainty in that datum’s true population value.

    • Data sets can be generated in a number of ways, but the two most common approaches are through an MCMC simulation technique such as Tanner & Wong’s (1987) Data Augmentationalgorithm or through bootstrapping likelihood estimates, such as the bootstrapped EM algorithm used by Amelia II.– SAS uses data augmentation to pull random draws from a specified

    posterior distribution (i.e., stationary distribution of EM estimates).

    • After m data sets have been created and the analysis model has been run on each separately, the resulting estimates are usually combined with Rubin’s Rules (Rubin, 1987).

  • 24www.Quant.KU.edu

    Rubin’s Rules

    • Formalized in Rubin’s (1987) book Multiple Imputation for Sample Surveys, Rubin’s Rules are a simple set of equations that

    allow unbiased* hypothesis testing by taking into consideration

    both the between and within imputation sources of variance.

    ( )

    1

    1 tmt

    Q Qm

    ( )

    1

    1 tmt

    U Um

    ( ) ( )1

    1

    Tt t

    B Q Q Q Qm

    1(1 )T U m B

    Point Estimates Within Imputation Variance

    Between Imputation Variance Total Variance

    * See Motivation section

  • 25www.Quant.KU.edu

    Rubin’s Rules

    • Point Estimates (Qbar): The average estimate across the M

    imputations.

    • Within Imputation Variance (Ubar): The average of the

    squared standard errors of a given parameter estimate

    • Between Imputation Variance (B): The observed variance,

    across imputations, in the parameter estimate

    • Combined Variance: Ubar plus B weighted by the number

    of imputations

    • Standard Error for parameter is the square-root of T.

  • 26www.Quant.KU.edu

    Fraction Missing

    • Fraction Missing is a measure of information lost to nonresponse that takes into account the strength of the

    relationships in the data.

    – By accounting for the strength of the relationships in the data, Fraction Missing is a better measure of how well MI will recapture

    the true population covariance matrix than is Percent Missing.

    • Fraction Missing is estimated after the m imputations• Fraction Missing is only reliably estimated with large

    numbers of imputations (e.g., m = 100).

    2 / ( 3)

    1

    r df

    r

    1(1 )m Br

    U

    where

    Formally, the formula for Fraction Missing is:

  • 27www.Quant.KU.edu

    Appropriate m by Fraction Missing

    • Graham, Olchowski, & Gilreath (2007) offered guidance on appropriate m based on acceptable power fall-off and fraction missing (see Table 5).

    γ

  • 28www.Quant.KU.edu

    • EM doesn’t account for uncertainty in our estimates of the missing data, so it will lead to artificially precise estimates of the variance/covariance structure. – Although there has been work done to correct for this artificial precision using

    correction terms (Savalei & Bentler, 2009), it is more parsimonious to base our technique on MI which implicitly accounts for the uncertainty in the estimates.

    • FIML can become cumbersome to implement when a large number of covariates are require to support the assumption of MAR missingness. – Including covariates in the imputation model is simple, and there is no need to

    include every variable in the imputation model in the analysis model.

    • Parameter significance depends on the method of scale setting (Gonzalez & Griffin, 2001).– Combining standard errors using Rubin’s rules may lead to biased estimates

    depending on the method of identification

    • Chi-squared difference tests are unbiased (Gonzalez & Griffin, 2001). – The change in chi-squared test should lead us to the most accurate test for

    parameter significance.

    Motivation

  • 29www.Quant.KU.edu

    • Generate multiply imputed datasets.

    • Calculate a single covariance matrix on all N*m observations.– By combining information from all m datasets, this matrix should

    represent the best estimate of the population associations.

    • Run the Analysis model on this single covariance matrix and use the resulting estimates as the basis for inference and hypothesis testing.– The fit function from this approach should be the best basis for making

    inferences about model fit and significance.

    • Using a Monte Carlo Simulation, we test the hypothesis that this approach is reasonable.

    Proposed Solution

  • Population Model

    A6A5A4A3A2

    Factor A Factor B

    .81 .72

    1* 1*

    Note: These are fully

    standardized parameter

    estimates

    A7 A8 A10 B6B2 B3 B4 B5 B7B1 B8 B9 B10A1

    .74 .70 .71 .79 .69 .81 .73

    A9

    .78

    .35 .49 .45 .52 .50 .38 .53 .35 .47 .39

    .75 .68 .76 .70 .72 .67 .69 .79 .72 .75

    .44 .53 .42 .51 .48 .55 .52 .38 .49 .43

    .52

    RMSEA = .047, CFI = .967, TLI = .962, SRMR = .021

  • 31www.Quant.KU.edu

    Absolute Fit

    Naïve Approach

    Chi Squared Across Replications

    Pop

    ulat

    ion

    10%

    Mis

    sing

    30%

    Mis

    sing

    50%

    Mis

    sing

    100

    200

    300

    400

    500

    600

    700

    Condition

    Ch

    i S

    qu

    are

    d

    Condition PRB

    10%

    Missing20.78%

    30%

    Missing80.33%

    50%

    Missing186.45%

  • 32www.Quant.KU.edu

    Absolute Fit

    Correlation Matrix Technique

    Condition PRB

    10%

    Missing8.76%

    30%

    Missing32.69%

    50%

    Missing70.46%

    Chi Squared Across Replications

    Pop

    ulat

    ion

    10%

    Mis

    sing

    30%

    Mis

    sing

    50%

    Mis

    sing

    100

    150

    200

    250

    300

    350

    400

    450

    Condition

    Ch

    i S

    qu

    are

    d

  • 33www.Quant.KU.edu

    Comparative Fit Index

    Correlation Matrix Technique

    Condition PRB

    10%

    Missing-3.22%

    30%

    Missing-11.51%

    50%

    Missing-23.05%

    CFI Across Replications

    Pop

    ulat

    ion

    10%

    Mis

    sing

    30%

    Mis

    sing

    50%

    Mis

    sing

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    1.1

    Condition

    CF

    I

  • 34www.Quant.KU.edu

    Standardized Root Mean Residual

    Correlation Matrix Technique

    SRMR Across Replications

    Pop

    ulat

    ion

    10%

    Mis

    sing

    30%

    Mis

    sing

    50%

    Mis

    sing

    0.04

    0.05

    0.06

    0.07

    0.08

    Condition

    SR

    MR

    Condition PRB

    10%

    Missing-0.42%

    30%

    Missing9.33%

    50%

    Missing23.62%

  • 35www.Quant.KU.edu

    Standardized Root Mean Residual

    Naïve Approach

    SRMR Across Replications

    Pop

    ulat

    ion

    10%

    Mis

    sing

    30%

    Mis

    sing

    50%

    Mis

    sing

    0.04

    0.05

    0.06

    0.07

    0.08

    0.09

    0.10

    0.11

    Condition

    SR

    MR

    Condition PRB

    10%

    Missing9.93%

    30%

    Missing34.03%

    50%

    Missing69.62%

  • 36www.Quant.KU.edu

    Change in Chi Squared Test

    Correlation Matrix Technique

    Change in Chi Squared Across Replications

    Pop

    ulat

    ion

    10%

    Mis

    sing

    30%

    Mis

    sing

    50%

    Mis

    sing

    0

    15

    30

    45

    60

    75

    90

    Condition

    Ch

    an

    ge in

    Ch

    i S

    qu

    are

    d

    Condition PRB

    10%

    Missing-2.95%

    30%

    Missing4.39%

    50%

    Missing6.08%