a robust approach for dealing with missing values in compositional data karel hron, matthias templ,...

17
A Robust Approach for Dealing with Missing Values in Compositional Data Karel Hron, Matthias Templ, Peter Filzmoser ICORS’08, Antalya, 8. 9. 2008

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

A Robust Approach for Dealingwith Missing Values in Compositional Data

Karel Hron, Matthias Templ,Peter Filzmoser

ICORS’08, Antalya, 8. 9. 2008

Compositional data (CoDa)

• ... D-part composition

• and contain essentially the same information

• simplex – sample space of D-part compositions

• D-1 dimensionality of compositions

Standard statistics and CoDa

• difficulties when applying standard statistical methods (like correlation analysis and PCA)

• the results can be completely useless• reason: sample space of CoDa, induces different

geometrical structure (Aitchison geometry)• solution: family of logratio transformations from

the simplex to real space (Aitchison, 1986)• in case of missing values in CoDa allow for a

reasonable imputation

Isometric logratio transformations

• shortly ilr (Egozcue et al., 2003), result in D-1 dimensional real space

• regularity of transformed data is provided, necessary for robust statistical methods

• isometry

Ilr and balances

• interpretation of ilr coordinates (balances) in the sense of original compositional parts is not possible

• reason: definition of CoDa• solution: split the parts into separated groups

and order balances• this construction is provided using a special

procedure, called sequential binary partition

Ilr and balances

• result of a special choice of sequential binary partition (SBP)

Outliers and CoDa

1) caused by Aitchison geometry:• provide measure of differences between the

compositions in a natural way, respecting their relative scale property

• distinguish between the following two differences within compositional parts,

0.500 and 0.501 vs. 0.001 and 0.002• consequence: the error term in the parts is not

the same for values close to the baricentre or to the border of the simplex

Outliers and CoDa

• solution: using ilr transformation and outlier detection (Filzmoser and Hron, 2008)

Outliers and CoDa

2) caused by definition of CoDa:• each observed composition is a member of the

corresponding equivalence class

• every two compositions from the same class have zero Aitchison distance

• low and high values of c can simultaneously cause high Euclidean distance

Outliers and CoDa

Missing values in CoDa sets

• most statistical methods cannot be directly applied on data sets with missing information

• removing incomplete observations can cause an unacceptable loss of information

• most of imputation methods use assumptions like missing at random (MAR) and normality of the data

• outliers could have a dramatical influence on the estimation of missing values

Missing values in CoDa sets

• with robust imputation methods the estimation of missings is based on the majority of the data

• existing robust methods may not deal with compositional data (another geometry of the data and wrong identification of outliers)

=> a more effective way of dealing with CoDa for imputation, with respect to the Aitchison geometry, is needed

Robust imputation of missing values for CoDa

• we propose an iterative procedure to estimate the missing values

• initialization of the missings: fast kNN (Aitchison)• compositional part with highest amount of

missings is chosen and the data are transformed using proper ilr transformation – missing values from the chosen part (x1) appear in one ilr variable and does not contaminate the others

Robust imputation of missing values for CoDa

• consequently, fast LTS regression (able to deal also with large data sets) of z1 on z2 ,…,zD-1 is prefered, but also other robust methods can be considered

• missing values are imputed for any variable (starting from the highest amount of missings)

• procedure is repeated in an iterative manner till convergence

Simulation study

Simulation study

References

• Aitchison, J., 1986, The statistical analysis of compositional data. Chapman and Hall, London.

• Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueraz, G., Barceló-Vidal, C., 2003, Isometric logratio transformations for compositional data analysis. Math. Geol., vo. 35, no. 3, p. 279-300.

• Filzmoser, P., Hron, K., 2008, Outlier detection for compositional data using robust methods. Math. Geosci.,

vo. 40, no. 3, p. 233-248.