a robust approach for dealing with missing values in compositional data karel hron, matthias templ,...
Post on 20-Dec-2015
216 views
TRANSCRIPT
A Robust Approach for Dealingwith Missing Values in Compositional Data
Karel Hron, Matthias Templ,Peter Filzmoser
ICORS’08, Antalya, 8. 9. 2008
Compositional data (CoDa)
• ... D-part composition
• and contain essentially the same information
• simplex – sample space of D-part compositions
• D-1 dimensionality of compositions
Standard statistics and CoDa
• difficulties when applying standard statistical methods (like correlation analysis and PCA)
• the results can be completely useless• reason: sample space of CoDa, induces different
geometrical structure (Aitchison geometry)• solution: family of logratio transformations from
the simplex to real space (Aitchison, 1986)• in case of missing values in CoDa allow for a
reasonable imputation
Isometric logratio transformations
• shortly ilr (Egozcue et al., 2003), result in D-1 dimensional real space
• regularity of transformed data is provided, necessary for robust statistical methods
• isometry
Ilr and balances
• interpretation of ilr coordinates (balances) in the sense of original compositional parts is not possible
• reason: definition of CoDa• solution: split the parts into separated groups
and order balances• this construction is provided using a special
procedure, called sequential binary partition
Outliers and CoDa
1) caused by Aitchison geometry:• provide measure of differences between the
compositions in a natural way, respecting their relative scale property
• distinguish between the following two differences within compositional parts,
0.500 and 0.501 vs. 0.001 and 0.002• consequence: the error term in the parts is not
the same for values close to the baricentre or to the border of the simplex
Outliers and CoDa
• solution: using ilr transformation and outlier detection (Filzmoser and Hron, 2008)
Outliers and CoDa
2) caused by definition of CoDa:• each observed composition is a member of the
corresponding equivalence class
• every two compositions from the same class have zero Aitchison distance
• low and high values of c can simultaneously cause high Euclidean distance
Missing values in CoDa sets
• most statistical methods cannot be directly applied on data sets with missing information
• removing incomplete observations can cause an unacceptable loss of information
• most of imputation methods use assumptions like missing at random (MAR) and normality of the data
• outliers could have a dramatical influence on the estimation of missing values
Missing values in CoDa sets
• with robust imputation methods the estimation of missings is based on the majority of the data
• existing robust methods may not deal with compositional data (another geometry of the data and wrong identification of outliers)
=> a more effective way of dealing with CoDa for imputation, with respect to the Aitchison geometry, is needed
Robust imputation of missing values for CoDa
• we propose an iterative procedure to estimate the missing values
• initialization of the missings: fast kNN (Aitchison)• compositional part with highest amount of
missings is chosen and the data are transformed using proper ilr transformation – missing values from the chosen part (x1) appear in one ilr variable and does not contaminate the others
Robust imputation of missing values for CoDa
• consequently, fast LTS regression (able to deal also with large data sets) of z1 on z2 ,…,zD-1 is prefered, but also other robust methods can be considered
• missing values are imputed for any variable (starting from the highest amount of missings)
• procedure is repeated in an iterative manner till convergence
References
• Aitchison, J., 1986, The statistical analysis of compositional data. Chapman and Hall, London.
• Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueraz, G., Barceló-Vidal, C., 2003, Isometric logratio transformations for compositional data analysis. Math. Geol., vo. 35, no. 3, p. 279-300.
• Filzmoser, P., Hron, K., 2008, Outlier detection for compositional data using robust methods. Math. Geosci.,
vo. 40, no. 3, p. 233-248.