from sparse regression to sparse multiple correspondence ... · from sparse regression to sparse...
TRANSCRIPT
From Sparse Regression to Sparse Multiple Correspondence Analysis
Gilbert Saporta CEDRIC, CNAM, Paris
• Joint work with Anne Bernard, Ph.D student funded by the R&D department of Chanel cosmetic company
• Industrial context and motivation: – Relate gene expression data to skin aging
measures
– n=500, p= 800 000 SNP’s, 15 000 genes
ECDA 2013, Luxemburg 2
Outline
1. Introduction 2. Regularized regression 3. Sparse regression 4. Sparse PCA 5. Sparse MCA 6. Conclusion and perspectives
ECDA 2013, Luxemburg 3
1.Introduction
• High dimensional data: p>>n – Gene expression data – Chemometrics – etc.
• Several solutions for regression problems with all variables; but interpretation is difficult
• Sparse methods: provide combinations of few variables
ECDA 2013, Luxemburg 4
• This talk: – a survey of sparse methods for supervised
(regression) and unsupervised (PCA) problems – New propositions in the unsupervised case when
variables belong to disjoint groups or blocks: • Group sparse PCA • Sparse multiple correspondence analysis
ECDA 2013, Luxemburg 5
2. Regularized regression
• No OLS solution when p>n • A special case of multicollinearity • Usual regularized regression techniques:
– Component based: PCR, PLS – Ridge
ECDA 2013, Luxemburg 6
2.1 Principal components regression
• First papers: Kendall, Hotelling (1957), Malinvaud (1964)
• At most n components when p>>n • Select q components and regress y upon them
– Orthogonal components sum of univariate regressions
– Back to original variables:
ECDA 2013, Luxemburg 7
• Principal components unrelated to the response variable y: – Ranking the components
• Not according to their eigenvalues • but according to r2(y;cj)
• Choice of q – crossvalidation
ECDA 2013, Luxemburg 8
1 1ˆ ˆˆ ˆ ˆˆ ... q qα α
=
= = + + = =
C UC XU
y Cα c c XUα = Xβ β Uα
components matrix n,q loadings matrix p,q
2.2 PLS regression
• Proposed by H. and S.Wold (1960’s) • Close to PCR: projection onto a set of
orthogonal combinations of predictors • PLS components optimised to be predictive of
both X and y variables • Tucker’s criterium: max cov2(y ;Xw)
ECDA 2013, Luxemburg 9
• Trade-off between maximizing correlation between t=Xw and y (OLS) and maximizing variance of t (PCA) :
cov2(y ;Xw)= r2(y ;Xw) V(Xw) V(y) • Easy solution:
– wj proportional to cov( y; xj) – No surprising signs..
• Further components by iteration on residuals • Stopping rule: cross-validation
ECDA 2013, Luxemburg 10
2.3 Ridge regression
• Hoerl & Kennard (1970)
• Several interpretations – Tikhonov regularization
( ) 1ˆR k −= +β X'X I X'y
( )2 2 2
2 2
min with
or min
c
λ
≤
+
y - Xβ β
y - Xβ β
ECDA 2013, Luxemburg 11
– Bayesian regression • Gaussian prior for β • Gaussian distribution Y/β
Maximum a posteriori or posterior expectation : Gives an interpretation for k
• Choice of k : – cross-validation
2( ; )N ψ0 I2( ; )N σXβ I
12
2ˆ σ
ψ
−
= +
β X'X I X'y
ECDA 2013, Luxemburg 12
• Shrinkage properties (Hastie et al. , 2009)
– PCR discards low variance directions – PLS shrinks low variance directions but inflates
high variance directions – Ridge shrinks all principal directions but shrinks
more low variance directions
• Lost properties: – Bias, scale invariance need standardised data
ECDA 2013, Luxemburg 13
3. Sparse regression
• Keeping all predictors is a drawback for high dimensional data: combinations of too many variables cannot be interpreted
• Sparse methods simultaneously shrink coefficients and select variables, hence better predictions
ECDA 2013, Luxemburg 14
3.1 Lasso and elastic-net
• Lasso (Tibshirani, 1996) imposes a L1 constraint on the coefficients
• Lasso continuously shrinks the coefficients towards zero
• Convex optimisation; no explicit solution
2
1
ˆ arg minp
lasso jj
λ β=
= − +
∑
ββ y Xβ
1
p
jj
b c=
<∑
ECDA 2013, Luxemburg 15
• Constraints and log-priors – Like ridge regression, the Lasso is a bayesian
regression but with an exponential prior
– is proportional to the log-prior jβ
( )( ) exp2j jf λβ λ β= −
ECDA 2013, Luxemburg 16
• Finding the optimal parameter – Cross validation if optimal prediction is needed – BIC when the sparsity is the main concern
a good unbiased estimate of df is the number of nonzero coefficients . (Zou et al., 2007)
2
2
ˆ( ) log( ) ˆarg min ( ) opt
y X n dfn nλ
β λλ λ
σ
− = +
ECDA 2013, Luxemburg 17
• A more general form: • q=2 ridge; q=1 Lasso; q=0 subset selection (counts the
number of variables) • q>1 do not provide null coefficients (derivability)
2
1
ˆ arg minp q
lasso jj
λ β=
= − +
∑
ββ y Xβ
ECDA 2013, Luxemburg 18
• Lasso produces a sparse model but the number of selected variables cannot exceed the number of units
• Elastic net: combine ridge penalty and lasso penalty to select more predictors than the number of observations (Zou & Hastie, 2005)
( )2 22 1 1
ˆ arg min en λ λ= − + +β
β y Xβ β β
ECDA 2013, Luxemburg 19
3.2 Group-lasso • X matrix divided into J
sub-matrices Xj of pj variables
• Group Lasso: extension of Lasso for selecting groups of variables (Yuan & Lin, 2007):
2
1 1
ˆ arg minJ J
GL j j j jj j
pβ
λ= =
= − +∑ ∑β y X β β
If pj=1 for all j, group Lasso = Lasso
ECDA 2013, Luxemburg 20
• Drawback: no sparsity within groups • A solution: sparse group lasso (Simon et al. , 2012)
– Two tuning parameters: grid search
2
1 21 1 1 1
minjpJ J J
j j j ijj j j i
λ λ β= = = =
− + +
∑ ∑ ∑∑βy X β β
ECDA 2013, Luxemburg 21
3.3 Other sparse regression methods
• SCAD penalty (Fan & Li, 2001) « smoothly clipped absolute deviation » – Non-convex
• Sparse PLS – Several extensions
• Chun & Keles (2010) • Le Cao et al. (2008)
ECDA 2013, Luxemburg 22
4.Sparse PCA
• In PCA, each PC is a linear combination of all the original variables : difficult to interpret the results
• Challenge of SPCA: obtain components easily interpretable (lot of zero loadings in principal factors)
• Principle of SPCA: modify PCA imposing lasso/elastic-net constraints to construct modified PCs with sparse loadings
• Warning: Sparse PCA does not provide a global selection of variables but a selection dimension by dimension : different from the regression context (Lasso, Elastic Net, …)
ECDA 2013, Luxemburg 23
4.1 First attempts: • Simple PCA
– by Vines (2000) : integer loadings – Rousson, V. and Gasser, T. (2004) : loadings (+ , 0, -)
• SCoTLASS (Simplified Component Technique –Lasso) by Jolliffe & al. (2003) : extra L1 constraints
2
1max with 1 and
p
jj
u t=
= = ≤∑u'Vu u u'u
ECDA 2013, Luxemburg 24
SCotLass properties: • Non convex problem
usual PCAt 1 no solution
1 only one nonzero coefficient
1
t p
t
t p
≥
<=
< <
ECDA 2013, Luxemburg 25
Let the SVD of X be with the principal components
Ridge regression: Loadings can be recovered by regressing (ridge regression) PCs on the p variables PCA can be written as a regression-type optimization problem
2 2
β
ˆ arg minridge λ= +β Z - Xβ β
( ) ( )
' '
2
2ˆ ii
ii
withd
d λ=
2 '
-1' 'i,ridge i i
X X = VD V V V = I
β = X X + λ I X Xv v+ i=v v
'X = UDV Z = UD
4.2 S-PCA by Zou et al (2006)
ECDA 2013, Luxemburg 26
is an approximation to , and the ith approximated component Produces sparse loadings with zero coefficients to
facilitate interpretation
Alternated algorithm between elastic net and SVD
Sparse PCA add a new penalty to produce sparse loadings:
ˆ arg minβ
λ= 2 2β Z - Xβ + β
ˆˆ
ˆiβv =β
ˆ iXv
Lasso penalty
iv
1 1λ+ β
ECDA 2013, Luxemburg 27
4.3 S-PCA via regularized SVD
• Shen & Huang (2008) : starts from the SVD with a smooth penalty (L1, SCAD, etc.)
( )
( ) '
1
2
1min - '
kk
j j jj
p
jj
d
h vλ
=
=
=
+
∑
∑u,v
X u v
X uv
ECDA 2013, Luxemburg 28
• Loss of orthogonality – SCotLass: orthogonal loadings but correlated
components – S-PCA: neither loadings, nor components are
orthogonal – Necessity of adjusting the % of explained
variance
29 ECDA 2013, Luxemburg 29
Data matrix X divided into J groups Xj of pj variables
Group Sparse PCA: compromise between SPCA and group Lasso
Goal: select groups of continuous variables (zero coefficients to entire blocks of variables)
Principle: replace the penalty function in the SPCA algorithm by that defined in the group Lasso
2 21 1
β
ˆ arg min λ λ= + +β Z - Xβ β β
2
1 1
ˆ arg minJ J
GL j j j jj j
pβ
λ= =
= − +∑ ∑β Z X β β
4.4 Group Sparse PCA
ECDA 2013, Luxemburg 30
ECDA 2013, Luxemburg 31
In MCA:
Selection of 1 column in the original table (categorical variable Xj )
= Selection of a block of pj indicator variables
in the complete disjunctive table
Sparse MCA : select categorical variables, not categories Principle: a straightforward extension of Group Sparse PCA for groups of indicator variables, with the chi-square
metric . Uses s-PCA r-SVD algorithm.
XJ
1 pJ
.
.
.
3
XJ1 … XJpj 1 0
.
.
.
0
0 1
.
.
.
0
Original table Complete disjunctive table
5.Sparse MCA
ECDA 2013, Luxemburg 32
And be the matrix of standardised residuals: Singular Value Decomposition
Let F be the n x q disjunctive table divided by the number of units
( )1 1- -T2 2
r cF = D F - rc D
TF = UΛV
qr = F1 Tnc = F 1 ( )rD = diag r ( )cD = diag c
F
ECDA 2013, Luxemburg 33
Properties MCA Sparse MCA
Uncorrelated Components TRUE FALSE
Orthogonal loadings TRUE FALSE
Barycentric property TRUE TRUE
% of inertia
Total inertia
100jtot
λ×
2
j.1,...,j-1Z
2
1
k
j=∑ j.1,...,j-1Z
1
1 1p
jj
pp =
−∑
are the residuals after adjusting for (regression projection) j.1,...,j-1Z1,...,j-1ZjZ
ECDA 2013, Luxemburg
Toy example: Dogs
Data:
n=27 breeds of dogs p=6 variables q=16 (total number of columns)
X : 27 x 6 matrix of categorical variables
K : 27 x 16 complete disjunctive table K=(K1, …, K6)
1 block = 1 Kj matrix
ECDA 2013, Luxemburg 35
Toy example: Dogs
λ= 0:25 is a compromise between the number of variables selected and the % of variance lost.
ECDA 2013, Luxemburg 36
Toy example: Comparison of the loadings
ECDA 2013, Luxemburg 37
Single Nucleotide Polymorphisms
Data:
n=502 individuals p=537 SNPs (among more than 800 000 of the original data base, 15000 genes) q=1554 (total number of columns)
X : 502 x 537matrix of qualitative variables
K : 502 x 1554 complete disjunctive table K=(K1, …, K1554)
1 block =
1 SNP = 1 Kj matrix
Application on genetic data
ECDA 2013, Luxemburg 38
Single Nucleotide Polymorphisms
Application on genetic data
ECDA 2013, Luxemburg 39
Single Nucleotide Polymorphisms
Application on genetic data
ECDA 2013, Luxemburg
λ= 0:005: CPEV= 0:32% and 174 columns selected on Comp 1
Comparison of the loadings
Application on genetic data
. .
ECDA 2013, Luxemburg 41
6.Conclusions and perspectives
• Sparse techniques provide elegant and efficient solutions to problems posed by high-dimensional data: – A new generation of data analysis methods with few
restrictive hypothesis
• Very powerful in a context of variable selection in high dimension issues: – reduce noise as well as computation time.
ECDA 2013, Luxemburg 42
• 2 new methods in a unsupervised multiblock data context: Group Sparse PCA for continuous variables, and Sparse MCA for categorical variables – Both methods produce sparse loadings structures that
makes easier the interpretation and the comprehension of the results
– Possibility of selecting superblocks (genes)
• Research in progress: – Extension of Sparse MCA to select groups and
predictors within a group (sparsity within groups) • sparsity at both group and individual feature levels • compromise between Sparse MCA and sparse group
lasso developed by Simon et al. (2012).
ECDA 2013, Luxemburg 43
Thanks for your attention
ECDA 2013, Luxemburg 44
References • Chun, H. and Keles, S. (2010), Sparse partial least squares for
simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.
• Fan, J. and Li, R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of American Statistical Association, 96, 1348-1360, 2001
• Hastie T., Tibshirani R., Friedman J. (2009) The elements of statistical learning, 2nd edition, Springer, 2009
• Jolliffe, I.T. , Trendafilov, N.T. and Uddin, M. (2003) A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics, 12, 531–547,
• Rousson, V. , Gasser, T. (2004), Simple component analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 53,539-555
• Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99:1015-1034. ECDA 2013, Luxemburg 45
• Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2012) A Sparse-Group Lasso. Journal of Computational and Graphical Statistics,
• Tenenhaus M. (1998) La régression PLS, Technip • Tibshirani, R. (1996) Regression shrinkage and selection via
the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288
• Vines, S.K., (2000) Simple principal components, Journal of the Royal Statistical Society: Series C (Applied Statistics), 49, 441-451
• Yuan, M., Lin, Y. (2007) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68, 49-67,
• Zou, H., Hastie , T. (2005) Regularization and variable selection via the elastic net. Journal of Computational and Graphical Statistics, 67, 301-320,
• Zou, H., Hastie, T. and Tibshirani, R. (2006) Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15, 265-286.
• H. Zou, T. Hastie, R. Tibshirani, (2007), On the “degrees of freedom” of the lasso, The Annals of Statistics, 35, 5, 2173–2192.
ECDA 2013, Luxemburg 46
• Topics: PLS Regression, PLS Path Modeling and their related methods with application in Management, Social Sciences, Chemometrics, Sensory Analysis, Industry and Life Sciences including genomics.
• Keynote speakers
Anne-Laure BOULESTEIX LMU München - Germany Peter BÜHLMANN ETH Zürich - Switzerland Mohamed HANAFI ONIRIS, Nantes-France
http://www.pls14.org/
ECDA 2013, Luxemburg 47