imputtiona of complex data with r-package vim: … · imputtiona of complex data with r-package...

IMPUTATION OF COMPLEX DATA WITH R-PACKAGEVIM: TRADITIONAL AND NEW METHODS BASED

ON ROBUST ESTIMATION.Key Invited Paper

Matthias Templ1,2, Alexander Kowarik1, Peter Filzmoser2

1 Department of Methodology, Statistics Austria2 Department of Statistics and Probability Theory, TU WIEN, Austria

UNECE Worksession on data editingLjubljana, Mai 10, 2011

Templ, et al. (STAT, TU) Robust Imputation Ljubljana, Mai 10, 2011 1 / 50

Content

1 Overview

2 Visualisation Tools

3 Robust Imputation: Motivation

4 Challenges

5 IRMI

6 Simulation results

7 Results from real-world data

8 Small walkthrough to VIM

9 Conclusion

1 Overview

2 Visualisation Tools

3 Robust Imputation: Motivation

4 Challenges

5 IRMI

6 Simulation results

7 Results from real-world data

8 Small walkthrough to VIM

9 Conclusion


Overview

R-package VIM

VIM = Visualization and Imputation of Missings

Univariate, bivariate, multiple and multivariate plot methods tohighlight missing values in complex data sets to learn about theirstructure (MCAR, MAR, MNAR). Comes with a GUI as well.

Hot-deck, k-NN and EM-based (robust) imputation methods forcomplex data sets. Due to time reasons we mostly concentrate onEM-based imputation. For hot-deck and k-NN, please have a look atthe paper.

VIM-book in 2012


Visualisation Tools

The GUI . . .


Visualisation Tools

The GUI . . .

Pro

port

ion

of m

issi

ngs

0.00

0.02

0.04

0.06

0.08

0.10

py01

0n

py03

5n

py05

0n

py07

0n

py08

0n

py09

0n

py10

0n

py11

0n

py12

0n

py13

0n

py14

0n

Com

bina

tions

py01

0n

py03

5n

py05

0n

py07

0n

py08

0n

py09

0n

py10

0n

py11

0n

py12

0n

py13

0n

py14

0n

3827425119534832231311119865443222222211111111111


Visualisation Tools

The GUI . . .

0.0

0.2

0.4

0.6

0.8

1.0

P033000

mis

sing

/obs

erve

d in

py0

10n

0 5 10 15 20 25 35

mis

sing


Visualisation Tools

The GUI . . .


Visualisation Tools

The GUI - Imputation


Visualisation Tools


kNN(data , variable=colnames(data), metric=NULL , k=5,

dist_var=colnames(data),weights=NULL ,numFun = median ,

catFun=maxCat ,makeNA=NULL ,NAcond=NULL , impNA=TRUE ,

donorcond=NULL ,mixed=vector(),trace=FALSE ,

imp_var=TRUE ,imp_suffix="imp",addRandom=FALSE)


Visualisation Tools


hotdeck(data , variable=colnames(data), ord_var=NULL ,

domain_var=NULL ,makeNA=NULL ,NAcond=NULL ,impNA=TRUE ,

donorcond=NULL ,imp_var=TRUE ,imp_suffix="imp")


Visualisation Tools


irmi(x, eps = 0.01 , maxit = 100, mixed = NULL ,

step = FALSE , robust = FALSE , takeAll = TRUE ,

noise = TRUE , noise.factor = 1, force = FALSE ,

robMethod = "lmrob", force.mixed = TRUE ,

mi = 1, trace=FALSE)


Robust Imputation: Motivation

Median Imputation

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

−2 0 2 4 6 8 10

−3

−2

−1

01

23

x

y ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

fully observedinclude missings in xinclude missings in yimputed values



kNN Imputation

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

−2 0 2 4 6 8 10

−3

−2

−1

01

23

x

y ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●




IVEWARE

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●




IRMI

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●




Median Imputation

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

−2 0 2 4 6 8 10

−3

−2

−1

01

2

x

y

● original dataimputed data values

tolerance ellipse from original datatolerance ellipse from imputed data



kNN Imputation

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

−2 0 2 4 6 8 10

−3

−2

−1

01

2

x

y

● original dataimputed data values




IVEWARE

●

●●

●

●

●

●

●

●● ●

●●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

● ●

●

●●

●

●

●●●

●

● ●● ● ●

●●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

● ●

● original data valuesimputed data values

original dataimputed data



IRMI

●

●●

●

●

●

●

●

●● ●

●●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

● ●

●

●●

●

●

●●●

●

● ●● ● ●

●●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

● ●





Median Imputation

●●

●

●

●

●

●

●

●● ●

●●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

● ●

●

●●

●

●

●●●

●

● ●● ● ●

● ●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

● ●

−2 0 2 4 6 8

−5

05

1015

2025

x

y




kNN Imputation

●●

●

●

●

●

●

●

●● ●

●●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

● ●

●

●●

●

●

●●●

●

● ●● ● ●

● ●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

● ●

−2 0 2 4 6 8

−5

05

1015

2025

x

y


●●●●



IVEWARE

●●

●

●

●

●

●

●

●● ●

●●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

● ●

●

●●

●

●

●●●

●

● ●● ● ●

● ●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

● ●



●●



IRMI

●

●●

●

●

●

●

●

●● ●

●●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

● ●

●

●●

●

●

●●●

●

● ●● ● ●

●●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

● ●




Challenges

Some Challenges

Mixed type of variables: various variables being nominal scaled, somevariables might be ordinal and some variables could bedetermined to be of continuous scale.

Semi-continuous variables: �semi-continuous� distributions, i.e. a variableconsisting of a continuous scaled part and a certainproportion of equal values.

Far from normality: Virtually always outlying observations included inreal-world data.

multiple imputation: Imputated must be both, re�ect the multivariatestructure of the data and including �randomness�.


Challenges

MIX, MICE, MI, MITOOLS, . . .

All missing values imputed with simulated values drawn from theirpredictive distribution given the observed data and the speci�edparameter.

−→ based on sequential regressions.

EM-based

In general, there are often problems when applied to complex data sets.

. . . and, of course, they are highly driven by in�uencial points,representative and non-representative outliers.


Challenges

IVEWARE

Very popular software used in many applications in O�cial Statistics.

Similar to the previous mentioned methods.

The imputations are obtained by �tting a sequence of (Bayesian)regression models and drawing values from the correspondingpredictive distributions.Sequentially imputation: in each step, one variable serve asresponse and certain other variables serves as predictors. Fit acertain model using the observed part of the response and estimate(update) the (former) missing values in the response.

Initialization loop: . . .

Second outer loop:

Estimates of missing values are updated sequentially using one variable

as response and all other variables as predictors until convergency.

Since missing values are drawn from their predictive distribution giventhe observed data and the speci�ed parameter, the procedure allowsmultiple imputation.


IRMI

IRMI

Only the second outer loop is used (missing values are initialised in another manner)

In contradiction to IVEWARE we use quite di�erent regressionmethods → Robust methods (Note: a lot of problems has to besolved when using robust methods for complex data like EU-SILC).

Alternatively, stepwise model selction tools are integrated using AICor BIC.

Multiple imputation is provided.


IRMI

Selection of Regression Models

If the response is

continuous, robust (IRMI) or ols (IMI, IVEWARE) regression methodsare used.

categorical , generalized linear regression is applied (IRMI: robust ornon-robust).

binary , logistic linear regression is applied (IRMI: robust butnon-robust is prefered).

mixed , a two-stage approach is used whereas in the �rst stage logisticregression is applied in order to decide if a missing value is imputedwith zero or by applying robust regression based on the continuouspart of the response.

count, robust generalized linear models (family: Poisson) is used.


Simulation results Measures of information loss

Errors from Categorical and Binary Variables

This error measure is de�ned as the proportion of imputed values takenfrom an incorrect category on all missing categorical or binary values:

errc =1

mc

pc∑j=1

n∑i=1

I(xorigij 6= ximpij ) , (1)

with I the indicator function, mc the number of missing values in the pccategorical variables, and n the number of observations.


Simulation results Measures of information loss

Errors from Continuous and Semi-continuous Variables

Here we assume that the constant part of the semi-continuous variable iszero. Then, the joint error measure is

errs =1

ms

ps∑j=1

n∑i=1

[∣∣∣∣∣(xorigij − x

impij )

xorigij

∣∣∣∣∣ · I(xorigij 6= 0 ∧ ximpij 6= 0) +

I((xorigij = 0 ∧ ximpij 6= 0) ∨ (xorigij 6= 0 ∧ x

impij = 0))

](6)

with ms the number of missing values in the ps continuous andsemi-continuous variables.


Simulation results Speci�c simulation results

Simulation Results: Varying the Correlation0.

00.

10.

20.

30.

40.

5

Correlation

err c

0.1 0.3 0.5 0.7 0.9

●

●

●

●

●

● IRMIIMIIVEWARE

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Correlation

err s

0.1 0.3 0.5 0.7 0.9

●

●

●

●

●

● IRMIIMIIVEWARE



Simulation Results: Varying the Amount of Variables0.

000.

050.

100.

150.

200.

250.

300.

35

Number of Variables

err c

4 12 20 28 36

●●

●

●●

● IRMIIMIIVEWARE

0.00

0.05

0.10

0.15

0.20

Number of Variables

err s

4 12 20 28 36

●

●

● ● ●

● IRMIIMIIVEWARE



Including (moderate) Outliers and Varying their Amount,high Correlation

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Outliers [%]

err c

0 10 20 30 40 50

●

●

●●

●●

● IRMIIMIIVEWARE

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Outliers [%]

err s

0 10 20 30 40 50

● ●●

●

●

●

● IRMIIMIIVEWARE



Including (moderate) Outliers and Varying their Amount,low Correlation

0.0

0.1

0.2

0.3

0.4

Outliers [%]

err c

0 10 20 30 40 50

●●

● ● ●●

● IRMIIMIIVEWARE

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Outliers [%]

err s

0 10 20 30 40 50

● ●● ●

●

●

● IRMIIMIIVEWARE


Results from real-world data

Imputation in EU-SILC

We considered certain HH-components, but also some nominal variables,such as household size, region and htype3.

1 R = 0

2 Set missing values in HH-components randomly (MCAR). R ++

3 Impute the missing values.

4 Evaluate the imputations using certain information loss measures.

5 Go to (2) until R = 100.



Imputation in EU-SILC, Results

●●●●●●●●

●●●●●●

●

●

●

IRMI IMI IVEWARE

2040

6080

100

err s



CENSUS Data - no outliers

● ●

IRMI IMI IVEWARE

0.2

0.4

0.6

0.8

err s

(i) Error for categorical variables

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

IRMI IMI IVEWARE0.

00.

20.

40.

6

err s

(j) Error for numerical variables



Airquality Data

●●

●

●●

●

●

●

●

●●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●●●

●

●

●

●

●

●

IRMI IMI IVE

0.5

1.0

1.5

err s


Small walkthrough to VIM

Example Data: SBS data

Pro

port

ion

of m

issi

ngs

0.00

0.01

0.02

0.03

0.04

0.05

Um

satz

B31

B41

B23 a1 a2 a6 a2

5 e2B

esch

Com

bina

tions

Um

satz

B31

B41

B23 a1 a2 a6 a2

5 e2B

esch



Most important functionality for imputation

Listing 1: Hotdeck imputation.

hotdeck(x, ord_var=c("Besch","Umsatz"), imp_var=FALSE)

Listing 2: k-nearest neighbor imputation.

kNN(x)

Listing 3: Application of robust iterative model-based imputation.

imp <- irmi(x)

. . . sensible defaults!



SBS data: Simulation Results

●●

●

●

●

●

●

●

●

●

hotdeck imi irmi kNN

020

0040

0060

0080

00

absolute deviation


Conclusion

Conclusion

We proposed the system VIM for visualization and imputation ofmissing values.

IRMI performs almost always best, but hot-deck methods have it'sadvantages as well (they are very fast and easy understandable)

VIM is an free and open-source project. It can be freely downloaded at

http://cran.r-project.org/package=VIM

Joint development and contributions are warmly welcome.


http://cran.r-project.org/package=VIM

Conclusion

References

M. Templ, A. Alfons, and A. Kowarik. VIM: Visualization and Imputationof Missing Values, 2011a. URL http://cran.r-project.org. Rpackage version 2.0.1.

Matthias Templ, Alexander Kowarik, and Peter Filzmoser. Iterativestepwise regression imputation using standard and robust methods.Computational Statistics Data Analysis, In Press, Uncorrected Proof,2011b. ISSN 0167-9473. doi: DOI:10.1016/j.csda.2011.04.012. URLhttp://www.sciencedirect.com/science/article/

B6V8V-52R7YYH-2/2/2c6c9ed7138d50c4197e991d8ffb8a1f.


http://cran.r-project.org

http://www.sciencedirect.com/science/article/B6V8V-52R7YYH-2/2/2c6c9ed7138d50c4197e991d8ffb8a1f

http://www.sciencedirect.com/science/article/B6V8V-52R7YYH-2/2/2c6c9ed7138d50c4197e991d8ffb8a1f

imputtiona of complex data with r-package vim: … · imputtiona of complex data with r-package...

Documents