multiple imputation: joint and conditional modeling of missing data
TRANSCRIPT
Mul$pleImputa$on
OctaviousTalbot&KazukiYoshidaDec16,2015
BIO235FinalProjectThisdocumentwascreatedbystudentstofulfillacourserequirement.Beawareofpoten$alerrors,andcheckwiththeoriginalpapers.ThereisacorrespondingreportdocumentathPps://github.com/kaz-yos/misc/blob/master/MI_Project.Rnw.pdf
Outline
• Background• Mul$pleImputa$on– JointDistribu$on– Condi$onalDistribu$on
• Compare/Contrast• Conclusion
Background
• Missingdataisanomnipresentproblemthataffectsalmostallrealdatasets.
• MIhasbecomeoneofthemostpopularmethodstoaddressmissingdata.
• WereviewmajorMIalgorithms,includingtheirrela$vestrengthsandweaknessesandimplica$onsforhigh-dimensionaldata.
Missingdataclassifica$on
• MissingCompletelyAtRandom(MCAR)
• MissingAtRandom(MAR)
• NotMissingAtRandom(NMAR)
Approaches
• Insufficient– Completecases,indicator,singleimputa$on
Approaches
• Insufficient– Completecases,indicator,singleimputa$on
• BePer– Mul$pleimputa$on
Approaches
• Insufficient– Completecases,indicator,singleimputa$on
• BePer– Mul$pleimputa$on– Likelihood-based– Weigh$ng
Approaches
• Insufficient– Completecases,indicator,singleimputa$on
• BePer– Mul$pleimputa$on– Likelihood-based– Weigh$ng
• Best
Approaches
• Insufficient– Completecases,indicator,singleimputa$on
• BePer– Mul$pleimputa$on– Likelihood-based– Weigh$ng
• Best– Preven$on
TheorybehindMI
• Posteriordistribu$onofquan$tyofinterestQgivenobserveddataonly
• Likelihood-basedapproachessuchasfullinforma$onmaximumlikelihood(FIML)modelthisexpressionitself.Butitcanbedifficult.
TheorybehindMI
• Posteriordistribu$onofquan$tyofinterestQgivenobserveddataonly
• Decomposeintomoretractableparts.– Distribu$onofQgivencompletedata(outcomemodel)
– Distribu$onofmissingdatagivenobserveddata(missingdatamodel)
– Integra$onovermissingdatadistribu$on
OverviewofMI
vanBuuren1999
Rubin’srule
OverviewofMI
Imputebasedonmissingdatamodel
Outcomemodelusingcompletedata
“Integrate”overimputeddatasets
Whatyouget
LiPle2002
MI:Twoapproachesfor
• Jointdistribu$onMI– U$lizesassumedjointdistribu$onofmissingandobserveddatatoimputemissingvalues
• Condi$onaldistribu$onMI– Modelsthecondi$onaldistribu$onofpar$allyobservedvalues(missingdata)
Jointapproach
• Twomainapproaches– Imputa$on-Posterior(IP)algorithm– Expecta$onMaximiza$on(EM)algorithm
• UsualAssump$ons– MVNjointdistribu$onforen$redataset– MAR
Jointapproach
Samplesfromdistribu$onofMVNparametersareobtained(MCMC).Samplesarecorrelated.UsingonechainforeachMVNisasolu$on.Implementedinnorm.
Pointes$matesofMVNparametersareobtained.Es$ma$onuncertaintyislost.BootstrappingEMisasolu$onforthis.Implementedinamelia.
Imputa$on-Posterior(IP)algorithm Expecta$on-Maximiza$on(EM)algorithm
King2001
EMwithbootstrap(amelia)
Honaker2015
->VaryingMVNparameteres$mates
Condi$onalapproach
• Modelsthemissing-nesswithindis$nctvariablessepeartelyanddoesnotassumejointdistribu$on.MARs$llholds.
Condi$onalapproach
• Modelsthemissing-nesswithindis$nctvariablessepeartelyanddoesnotassumejointdistribu$on.MARs$llholds.
vanBuuren2006
Comparison• JointDistribu$on– MVNcanbeanunreasonableassump$onwhendealingwithcategoricalvariablesandrequiresmoreumph
– Robustwhendealingwithcon$nuousvariables– Guaranteesconvergence(MCMC)
• Condi$onalDistribu$on– Rela$velymoreflexible– Theore$calconvergencepimalls– Robustinsimula$on
High-dimensionaldata
• ThejointMIhasanissuewithahugecovariancematrixmanyparameters,whereasthecondi$onalMIhasanoverfinngissueforeachregressionmodel.
• Introducingstructuresforthecovariancematrix(jointMI)[1]andusingregulariza$on(condi$onalMI)[2]havebeenexamined.
• Widelyavailablesoqwareimplementa$onsarelacking.
[1]He2014;[2]Zhao2013
Rpackages
SeebelowforRcodeexampleshPp://rpubs.com/kaz_yos/mi-examples
R:miceadds(highdimensionalFCS(condi$onal)throughPLS)SASPROCMI:EMandMCMC(joint)andFCS(condi$onal)Stata:miimputemvn(joint,MCMC),ice(condi$onal),andsmcfcs(condi$onal)
Conclusion
• Thejointapproachistheore$callymoresound• Thecondi$onalapproaches$matesthejointapproachandalthoughithasbeeneffec$veinsimula$onsitisnottheore$callyguaranteed.
• Bothmethodshavedifficultywithhigh-dimensionaldatawherethenumberofcovariatesarelargerthanthenumberofobserva$ons.