imputtiona of complex data with r-package vim: … · imputtiona of complex data with r-package...
TRANSCRIPT
IMPUTATION OF COMPLEX DATA WITH R-PACKAGEVIM: TRADITIONAL AND NEW METHODS BASED
ON ROBUST ESTIMATION.Key Invited Paper
Matthias Templ1,2, Alexander Kowarik1, Peter Filzmoser2
1 Department of Methodology, Statistics Austria2 Department of Statistics and Probability Theory, TU WIEN, Austria
UNECE Worksession on data editingLjubljana, Mai 10, 2011
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 1 / 50
Content
1 Overview
2 Visualisation Tools
3 Robust Imputation: Motivation
4 Challenges
5 IRMI
6 Simulation results
7 Results from real-world data
8 Small walkthrough to VIM
9 Conclusion
1 Overview
2 Visualisation Tools
3 Robust Imputation: Motivation
4 Challenges
5 IRMI
6 Simulation results
7 Results from real-world data
8 Small walkthrough to VIM
9 Conclusion
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 2 / 50
Overview
R-package VIM
VIM = Visualization and Imputation of Missings
Univariate, bivariate, multiple and multivariate plot methods tohighlight missing values in complex data sets to learn about theirstructure (MCAR, MAR, MNAR). Comes with a GUI as well.
Hot-deck, k-NN and EM-based (robust) imputation methods forcomplex data sets. Due to time reasons we mostly concentrate onEM-based imputation. For hot-deck and k-NN, please have a look atthe paper.
VIM-book in 2012
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 3 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 4 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 5 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 6 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 7 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 8 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 9 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 10 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 11 / 50
Visualisation Tools
The GUI . . .
Pro
port
ion
of m
issi
ngs
0.00
0.02
0.04
0.06
0.08
0.10
py01
0n
py03
5n
py05
0n
py07
0n
py08
0n
py09
0n
py10
0n
py11
0n
py12
0n
py13
0n
py14
0n
Com
bina
tions
py01
0n
py03
5n
py05
0n
py07
0n
py08
0n
py09
0n
py10
0n
py11
0n
py12
0n
py13
0n
py14
0n
3827425119534832231311119865443222222211111111111
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 12 / 50
Visualisation Tools
The GUI . . .
0.0
0.2
0.4
0.6
0.8
1.0
P033000
mis
sing
/obs
erve
d in
py0
10n
0 5 10 15 20 25 35
mis
sing
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 13 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 14 / 50
Visualisation Tools
The GUI - Imputation
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 15 / 50
Visualisation Tools
The GUI - Imputation
kNN(data , variable=colnames(data), metric=NULL , k=5,
dist_var=colnames(data),weights=NULL ,numFun = median ,
catFun=maxCat ,makeNA=NULL ,NAcond=NULL , impNA=TRUE ,
donorcond=NULL ,mixed=vector(),trace=FALSE ,
imp_var=TRUE ,imp_suffix="imp",addRandom=FALSE)
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 16 / 50
Visualisation Tools
The GUI - Imputation
hotdeck(data , variable=colnames(data), ord_var=NULL ,
domain_var=NULL ,makeNA=NULL ,NAcond=NULL ,impNA=TRUE ,
donorcond=NULL ,imp_var=TRUE ,imp_suffix="imp")
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 17 / 50
Visualisation Tools
The GUI - Imputation
irmi(x, eps = 0.01 , maxit = 100, mixed = NULL ,
step = FALSE , robust = FALSE , takeAll = TRUE ,
noise = TRUE , noise.factor = 1, force = FALSE ,
robMethod = "lmrob", force.mixed = TRUE ,
mi = 1, trace=FALSE)
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 18 / 50
Robust Imputation: Motivation
Median Imputation
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
−2 0 2 4 6 8 10
−3
−2
−1
01
23
x
y ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
fully observedinclude missings in xinclude missings in yimputed values
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 19 / 50
Robust Imputation: Motivation
kNN Imputation
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
−2 0 2 4 6 8 10
−3
−2
−1
01
23
x
y ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
fully observedinclude missings in xinclude missings in yimputed values
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 20 / 50
Robust Imputation: Motivation
IVEWARE
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
fully observedinclude missings in xinclude missings in yimputed values
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 21 / 50
Robust Imputation: Motivation
IRMI
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
fully observedinclude missings in xinclude missings in yimputed values
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 22 / 50
Robust Imputation: Motivation
Median Imputation
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
−2 0 2 4 6 8 10
−3
−2
−1
01
2
x
y
● original dataimputed data values
tolerance ellipse from original datatolerance ellipse from imputed data
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 23 / 50
Robust Imputation: Motivation
kNN Imputation
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
−2 0 2 4 6 8 10
−3
−2
−1
01
2
x
y
● original dataimputed data values
tolerance ellipse from original datatolerance ellipse from imputed data
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 24 / 50
Robust Imputation: Motivation
IVEWARE
●
●●
●
●
●
●
●
●● ●
●●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●●
●
●
●●●
●
● ●● ● ●
●●
●
●
●●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
● ●
● original data valuesimputed data values
original dataimputed data
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 25 / 50
Robust Imputation: Motivation
IRMI
●
●●
●
●
●
●
●
●● ●
●●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●●
●
●
●●●
●
● ●● ● ●
●●
●
●
●●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
● ●
● original data valuesimputed data values
original dataimputed data
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 26 / 50
Robust Imputation: Motivation
Median Imputation
●●
●
●
●
●
●
●
●● ●
●●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●●
●
●
●●●
●
● ●● ● ●
● ●
●
●
●●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
● ●
−2 0 2 4 6 8
−5
05
1015
2025
x
y
tolerance ellipse from original datatolerance ellipse from imputed data
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 27 / 50
Robust Imputation: Motivation
kNN Imputation
●●
●
●
●
●
●
●
●● ●
●●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●●
●
●
●●●
●
● ●● ● ●
● ●
●
●
●●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
● ●
−2 0 2 4 6 8
−5
05
1015
2025
x
y
tolerance ellipse from original datatolerance ellipse from imputed data
●●●●
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 28 / 50
Robust Imputation: Motivation
IVEWARE
●●
●
●
●
●
●
●
●● ●
●●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●●
●
●
●●●
●
● ●● ● ●
● ●
●
●
●●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
● ●
● original data valuesimputed data values
original dataimputed data
●●
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 29 / 50
Robust Imputation: Motivation
IRMI
●
●●
●
●
●
●
●
●● ●
●●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●●
●
●
●●●
●
● ●● ● ●
●●
●
●
●●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
● ●
● original data valuesimputed data values
original dataimputed data
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 30 / 50
Challenges
Some Challenges
Mixed type of variables: various variables being nominal scaled, somevariables might be ordinal and some variables could bedetermined to be of continuous scale.
Semi-continuous variables: �semi-continuous� distributions, i.e. a variableconsisting of a continuous scaled part and a certainproportion of equal values.
Far from normality: Virtually always outlying observations included inreal-world data.
multiple imputation: Imputated must be both, re�ect the multivariatestructure of the data and including �randomness�.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 31 / 50
Challenges
Some Challenges
Mixed type of variables: various variables being nominal scaled, somevariables might be ordinal and some variables could bedetermined to be of continuous scale.
Semi-continuous variables: �semi-continuous� distributions, i.e. a variableconsisting of a continuous scaled part and a certainproportion of equal values.
Far from normality: Virtually always outlying observations included inreal-world data.
multiple imputation: Imputated must be both, re�ect the multivariatestructure of the data and including �randomness�.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 31 / 50
Challenges
Some Challenges
Mixed type of variables: various variables being nominal scaled, somevariables might be ordinal and some variables could bedetermined to be of continuous scale.
Semi-continuous variables: �semi-continuous� distributions, i.e. a variableconsisting of a continuous scaled part and a certainproportion of equal values.
Far from normality: Virtually always outlying observations included inreal-world data.
multiple imputation: Imputated must be both, re�ect the multivariatestructure of the data and including �randomness�.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 31 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from theirpredictive distribution given the observed data and the speci�edparameter.
−→ based on sequential regressions.
EM-based
In general, there are often problems when applied to complex data sets.
. . . and, of course, they are highly driven by in�uencial points,representative and non-representative outliers.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 32 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from theirpredictive distribution given the observed data and the speci�edparameter.
−→ based on sequential regressions.
EM-based
In general, there are often problems when applied to complex data sets.
. . . and, of course, they are highly driven by in�uencial points,representative and non-representative outliers.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 32 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from theirpredictive distribution given the observed data and the speci�edparameter.
−→ based on sequential regressions.
EM-based
In general, there are often problems when applied to complex data sets.
. . . and, of course, they are highly driven by in�uencial points,representative and non-representative outliers.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 32 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from theirpredictive distribution given the observed data and the speci�edparameter.
−→ based on sequential regressions.
EM-based
In general, there are often problems when applied to complex data sets.
. . . and, of course, they are highly driven by in�uencial points,representative and non-representative outliers.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 32 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from theirpredictive distribution given the observed data and the speci�edparameter.
−→ based on sequential regressions.
EM-based
In general, there are often problems when applied to complex data sets.
. . . and, of course, they are highly driven by in�uencial points,representative and non-representative outliers.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 32 / 50
Challenges
IVEWARE
Very popular software used in many applications in O�cial Statistics.
Similar to the previous mentioned methods.
The imputations are obtained by �tting a sequence of (Bayesian)regression models and drawing values from the correspondingpredictive distributions.Sequentially imputation: in each step, one variable serve asresponse and certain other variables serves as predictors. Fit acertain model using the observed part of the response and estimate(update) the (former) missing values in the response.
Initialization loop: . . .
Second outer loop:
Estimates of missing values are updated sequentially using one variable
as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution giventhe observed data and the speci�ed parameter, the procedure allowsmultiple imputation.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 33 / 50
Challenges
IVEWARE
Very popular software used in many applications in O�cial Statistics.
Similar to the previous mentioned methods.
The imputations are obtained by �tting a sequence of (Bayesian)regression models and drawing values from the correspondingpredictive distributions.Sequentially imputation: in each step, one variable serve asresponse and certain other variables serves as predictors. Fit acertain model using the observed part of the response and estimate(update) the (former) missing values in the response.
Initialization loop: . . .
Second outer loop:
Estimates of missing values are updated sequentially using one variable
as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution giventhe observed data and the speci�ed parameter, the procedure allowsmultiple imputation.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 33 / 50
Challenges
IVEWARE
Very popular software used in many applications in O�cial Statistics.
Similar to the previous mentioned methods.
The imputations are obtained by �tting a sequence of (Bayesian)regression models and drawing values from the correspondingpredictive distributions.Sequentially imputation: in each step, one variable serve asresponse and certain other variables serves as predictors. Fit acertain model using the observed part of the response and estimate(update) the (former) missing values in the response.
Initialization loop: . . .
Second outer loop:
Estimates of missing values are updated sequentially using one variable
as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution giventhe observed data and the speci�ed parameter, the procedure allowsmultiple imputation.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 33 / 50
Challenges
IVEWARE
Very popular software used in many applications in O�cial Statistics.
Similar to the previous mentioned methods.
The imputations are obtained by �tting a sequence of (Bayesian)regression models and drawing values from the correspondingpredictive distributions.Sequentially imputation: in each step, one variable serve asresponse and certain other variables serves as predictors. Fit acertain model using the observed part of the response and estimate(update) the (former) missing values in the response.
Initialization loop: . . .
Second outer loop:
Estimates of missing values are updated sequentially using one variable
as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution giventhe observed data and the speci�ed parameter, the procedure allowsmultiple imputation.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 33 / 50
Challenges
IVEWARE
Very popular software used in many applications in O�cial Statistics.
Similar to the previous mentioned methods.
The imputations are obtained by �tting a sequence of (Bayesian)regression models and drawing values from the correspondingpredictive distributions.Sequentially imputation: in each step, one variable serve asresponse and certain other variables serves as predictors. Fit acertain model using the observed part of the response and estimate(update) the (former) missing values in the response.
Initialization loop: . . .
Second outer loop:
Estimates of missing values are updated sequentially using one variable
as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution giventhe observed data and the speci�ed parameter, the procedure allowsmultiple imputation.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 33 / 50
IRMI
IRMI
Only the second outer loop is used (missing values are initialised in another manner)
In contradiction to IVEWARE we use quite di�erent regressionmethods → Robust methods (Note: a lot of problems has to besolved when using robust methods for complex data like EU-SILC).
Alternatively, stepwise model selction tools are integrated using AICor BIC.
Multiple imputation is provided.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 34 / 50
IRMI
IRMI
Only the second outer loop is used (missing values are initialised in another manner)
In contradiction to IVEWARE we use quite di�erent regressionmethods → Robust methods (Note: a lot of problems has to besolved when using robust methods for complex data like EU-SILC).
Alternatively, stepwise model selction tools are integrated using AICor BIC.
Multiple imputation is provided.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 34 / 50
IRMI
IRMI
Only the second outer loop is used (missing values are initialised in another manner)
In contradiction to IVEWARE we use quite di�erent regressionmethods → Robust methods (Note: a lot of problems has to besolved when using robust methods for complex data like EU-SILC).
Alternatively, stepwise model selction tools are integrated using AICor BIC.
Multiple imputation is provided.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 34 / 50
IRMI
Selection of Regression Models
If the response is
continuous, robust (IRMI) or ols (IMI, IVEWARE) regression methodsare used.
categorical , generalized linear regression is applied (IRMI: robust ornon-robust).
binary , logistic linear regression is applied (IRMI: robust butnon-robust is prefered).
mixed , a two-stage approach is used whereas in the �rst stage logisticregression is applied in order to decide if a missing value is imputedwith zero or by applying robust regression based on the continuouspart of the response.
count, robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 35 / 50
IRMI
Selection of Regression Models
If the response is
continuous, robust (IRMI) or ols (IMI, IVEWARE) regression methodsare used.
categorical , generalized linear regression is applied (IRMI: robust ornon-robust).
binary , logistic linear regression is applied (IRMI: robust butnon-robust is prefered).
mixed , a two-stage approach is used whereas in the �rst stage logisticregression is applied in order to decide if a missing value is imputedwith zero or by applying robust regression based on the continuouspart of the response.
count, robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 35 / 50
IRMI
Selection of Regression Models
If the response is
continuous, robust (IRMI) or ols (IMI, IVEWARE) regression methodsare used.
categorical , generalized linear regression is applied (IRMI: robust ornon-robust).
binary , logistic linear regression is applied (IRMI: robust butnon-robust is prefered).
mixed , a two-stage approach is used whereas in the �rst stage logisticregression is applied in order to decide if a missing value is imputedwith zero or by applying robust regression based on the continuouspart of the response.
count, robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 35 / 50
IRMI
Selection of Regression Models
If the response is
continuous, robust (IRMI) or ols (IMI, IVEWARE) regression methodsare used.
categorical , generalized linear regression is applied (IRMI: robust ornon-robust).
binary , logistic linear regression is applied (IRMI: robust butnon-robust is prefered).
mixed , a two-stage approach is used whereas in the �rst stage logisticregression is applied in order to decide if a missing value is imputedwith zero or by applying robust regression based on the continuouspart of the response.
count, robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 35 / 50
IRMI
Selection of Regression Models
If the response is
continuous, robust (IRMI) or ols (IMI, IVEWARE) regression methodsare used.
categorical , generalized linear regression is applied (IRMI: robust ornon-robust).
binary , logistic linear regression is applied (IRMI: robust butnon-robust is prefered).
mixed , a two-stage approach is used whereas in the �rst stage logisticregression is applied in order to decide if a missing value is imputedwith zero or by applying robust regression based on the continuouspart of the response.
count, robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 35 / 50
IRMI
Selection of Regression Models
If the response is
continuous, robust (IRMI) or ols (IMI, IVEWARE) regression methodsare used.
categorical , generalized linear regression is applied (IRMI: robust ornon-robust).
binary , logistic linear regression is applied (IRMI: robust butnon-robust is prefered).
mixed , a two-stage approach is used whereas in the �rst stage logisticregression is applied in order to decide if a missing value is imputedwith zero or by applying robust regression based on the continuouspart of the response.
count, robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 35 / 50
Simulation results Measures of information loss
Errors from Categorical and Binary Variables
This error measure is de�ned as the proportion of imputed values takenfrom an incorrect category on all missing categorical or binary values:
errc =1
mc
pc∑j=1
n∑i=1
I(xorigij 6= ximpij ) , (1)
with I the indicator function, mc the number of missing values in the pccategorical variables, and n the number of observations.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 36 / 50
Simulation results Measures of information loss
Errors from Continuous and Semi-continuous Variables
Here we assume that the constant part of the semi-continuous variable iszero. Then, the joint error measure is
errs =1
ms
ps∑j=1
n∑i=1
[∣∣∣∣∣(xorigij − x
impij )
xorigij
∣∣∣∣∣ · I(xorigij 6= 0 ∧ ximpij 6= 0) +
I((xorigij = 0 ∧ ximpij 6= 0) ∨ (xorigij 6= 0 ∧ x
impij = 0))
](6)
with ms the number of missing values in the ps continuous andsemi-continuous variables.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 37 / 50
Simulation results Speci�c simulation results
Simulation Results: Varying the Correlation0.
00.
10.
20.
30.
40.
5
Correlation
err c
0.1 0.3 0.5 0.7 0.9
●
●
●
●
●
● IRMIIMIIVEWARE
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Correlation
err s
0.1 0.3 0.5 0.7 0.9
●
●
●
●
●
● IRMIIMIIVEWARE
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 38 / 50
Simulation results Speci�c simulation results
Simulation Results: Varying the Amount of Variables0.
000.
050.
100.
150.
200.
250.
300.
35
Number of Variables
err c
4 12 20 28 36
●●
●
●●
● IRMIIMIIVEWARE
0.00
0.05
0.10
0.15
0.20
Number of Variables
err s
4 12 20 28 36
●
●
● ● ●
● IRMIIMIIVEWARE
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 39 / 50
Simulation results Speci�c simulation results
Including (moderate) Outliers and Varying their Amount,high Correlation
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Outliers [%]
err c
0 10 20 30 40 50
●
●
●●
●●
● IRMIIMIIVEWARE
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Outliers [%]
err s
0 10 20 30 40 50
● ●●
●
●
●
● IRMIIMIIVEWARE
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 40 / 50
Simulation results Speci�c simulation results
Including (moderate) Outliers and Varying their Amount,low Correlation
0.0
0.1
0.2
0.3
0.4
Outliers [%]
err c
0 10 20 30 40 50
●●
● ● ●●
● IRMIIMIIVEWARE
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Outliers [%]
err s
0 10 20 30 40 50
● ●● ●
●
●
● IRMIIMIIVEWARE
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 41 / 50
Results from real-world data
Imputation in EU-SILC
We considered certain HH-components, but also some nominal variables,such as household size, region and htype3.
1 R = 0
2 Set missing values in HH-components randomly (MCAR). R ++
3 Impute the missing values.
4 Evaluate the imputations using certain information loss measures.
5 Go to (2) until R = 100.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 42 / 50
Results from real-world data
Imputation in EU-SILC, Results
●●●●●●●●
●●●●●●
●
●
●
IRMI IMI IVEWARE
2040
6080
100
err s
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 43 / 50
Results from real-world data
CENSUS Data - no outliers
● ●
IRMI IMI IVEWARE
0.2
0.4
0.6
0.8
err s
(i) Error for categorical variables
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
IRMI IMI IVEWARE0.
00.
20.
40.
6
err s
(j) Error for numerical variables
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 44 / 50
Results from real-world data
Airquality Data
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
IRMI IMI IVE
0.5
1.0
1.5
err s
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 45 / 50
Small walkthrough to VIM
Example Data: SBS data
Pro
port
ion
of m
issi
ngs
0.00
0.01
0.02
0.03
0.04
0.05
Um
satz
B31
B41
B23 a1 a2 a6 a2
5 e2B
esch
Com
bina
tions
Um
satz
B31
B41
B23 a1 a2 a6 a2
5 e2B
esch
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 46 / 50
Small walkthrough to VIM
Most important functionality for imputation
Listing 1: Hotdeck imputation.
hotdeck(x, ord_var=c("Besch","Umsatz"), imp_var=FALSE)
Listing 2: k-nearest neighbor imputation.
kNN(x)
Listing 3: Application of robust iterative model-based imputation.
imp <- irmi(x)
. . . sensible defaults!
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 47 / 50
Small walkthrough to VIM
SBS data: Simulation Results
●●
●
●
●
●
●
●
●
●
hotdeck imi irmi kNN
020
0040
0060
0080
00
absolute deviation
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 48 / 50
Conclusion
Conclusion
We proposed the system VIM for visualization and imputation ofmissing values.
IRMI performs almost always best, but hot-deck methods have it'sadvantages as well (they are very fast and easy understandable)
VIM is an free and open-source project. It can be freely downloaded at
http://cran.r-project.org/package=VIM
Joint development and contributions are warmly welcome.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 49 / 50
Conclusion
References
M. Templ, A. Alfons, and A. Kowarik. VIM: Visualization and Imputationof Missing Values, 2011a. URL http://cran.r-project.org. Rpackage version 2.0.1.
Matthias Templ, Alexander Kowarik, and Peter Filzmoser. Iterativestepwise regression imputation using standard and robust methods.Computational Statistics Data Analysis, In Press, Uncorrected Proof,2011b. ISSN 0167-9473. doi: DOI:10.1016/j.csda.2011.04.012. URLhttp://www.sciencedirect.com/science/article/
B6V8V-52R7YYH-2/2/2c6c9ed7138d50c4197e991d8ffb8a1f.
Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 50 / 50