from sparse regression to sparse multiple correspondence ... · from sparse regression to sparse...

From Sparse Regression to Sparse Multiple Correspondence Analysis

Gilbert Saporta CEDRIC, CNAM, Paris

[email protected]

• Joint work with Anne Bernard, Ph.D student funded by the R&D department of Chanel cosmetic company

• Industrial context and motivation: – Relate gene expression data to skin aging

measures

– n=500, p= 800 000 SNP’s, 15 000 genes

ECDA 2013, Luxemburg 2

Outline

1. Introduction 2. Regularized regression 3. Sparse regression 4. Sparse PCA 5. Sparse MCA 6. Conclusion and perspectives


1.Introduction

• High dimensional data: p>>n – Gene expression data – Chemometrics – etc.

• Several solutions for regression problems with all variables; but interpretation is difficult

• Sparse methods: provide combinations of few variables


• This talk: – a survey of sparse methods for supervised

(regression) and unsupervised (PCA) problems – New propositions in the unsupervised case when

variables belong to disjoint groups or blocks: • Group sparse PCA • Sparse multiple correspondence analysis


2. Regularized regression

• No OLS solution when p>n • A special case of multicollinearity • Usual regularized regression techniques:

– Component based: PCR, PLS – Ridge


2.1 Principal components regression

• First papers: Kendall, Hotelling (1957), Malinvaud (1964)

• At most n components when p>>n • Select q components and regress y upon them

– Orthogonal components sum of univariate regressions

– Back to original variables:


• Principal components unrelated to the response variable y: – Ranking the components

• Not according to their eigenvalues • but according to r2(y;cj)

• Choice of q – crossvalidation


1 1ˆ ˆˆ ˆ ˆˆ ... q qα α

=

= = + + = =

C UC XU

y Cα c c XUα = Xβ β Uα

components matrix n,q loadings matrix p,q

2.2 PLS regression

• Proposed by H. and S.Wold (1960’s) • Close to PCR: projection onto a set of

orthogonal combinations of predictors • PLS components optimised to be predictive of

both X and y variables • Tucker’s criterium: max cov2(y ;Xw)


• Trade-off between maximizing correlation between t=Xw and y (OLS) and maximizing variance of t (PCA) :

cov2(y ;Xw)= r2(y ;Xw) V(Xw) V(y) • Easy solution:

– wj proportional to cov( y; xj) – No surprising signs..

• Further components by iteration on residuals • Stopping rule: cross-validation


2.3 Ridge regression

• Hoerl & Kennard (1970)

• Several interpretations – Tikhonov regularization

( ) 1ˆR k −= +β X'X I X'y

( )2 2 2

2 2

min with

or min

c

λ

≤

+

y - Xβ β

y - Xβ β


– Bayesian regression • Gaussian prior for β • Gaussian distribution Y/β

Maximum a posteriori or posterior expectation : Gives an interpretation for k

• Choice of k : – cross-validation

2( ; )N ψ0 I2( ; )N σXβ I

12

2ˆ σ

ψ

−

= +

β X'X I X'y


• Shrinkage properties (Hastie et al. , 2009)

– PCR discards low variance directions – PLS shrinks low variance directions but inflates

high variance directions – Ridge shrinks all principal directions but shrinks

more low variance directions

• Lost properties: – Bias, scale invariance need standardised data


3. Sparse regression

• Keeping all predictors is a drawback for high dimensional data: combinations of too many variables cannot be interpreted

• Sparse methods simultaneously shrink coefficients and select variables, hence better predictions


3.1 Lasso and elastic-net

• Lasso (Tibshirani, 1996) imposes a L1 constraint on the coefficients

• Lasso continuously shrinks the coefficients towards zero

• Convex optimisation; no explicit solution

2

1

ˆ arg minp

lasso jj

λ β=

= − +

∑

ββ y Xβ

1

p

jj

b c=

<∑


• Constraints and log-priors – Like ridge regression, the Lasso is a bayesian

regression but with an exponential prior

– is proportional to the log-prior jβ

( )( ) exp2j jf λβ λ β= −


• Finding the optimal parameter – Cross validation if optimal prediction is needed – BIC when the sparsity is the main concern

a good unbiased estimate of df is the number of nonzero coefficients . (Zou et al., 2007)

2

2

ˆ( ) log( ) ˆarg min ( ) opt

y X n dfn nλ

β λλ λ

σ

− = +


• A more general form: • q=2 ridge; q=1 Lasso; q=0 subset selection (counts the

number of variables) • q>1 do not provide null coefficients (derivability)

2

1

ˆ arg minp q

lasso jj

λ β=

= − +

∑

ββ y Xβ


• Lasso produces a sparse model but the number of selected variables cannot exceed the number of units

• Elastic net: combine ridge penalty and lasso penalty to select more predictors than the number of observations (Zou & Hastie, 2005)

( )2 22 1 1

ˆ arg min en λ λ= − + +β

β y Xβ β β


3.2 Group-lasso • X matrix divided into J

sub-matrices Xj of pj variables

• Group Lasso: extension of Lasso for selecting groups of variables (Yuan & Lin, 2007):

2

1 1

ˆ arg minJ J

GL j j j jj j

pβ

λ= =

= − +∑ ∑β y X β β

If pj=1 for all j, group Lasso = Lasso


• Drawback: no sparsity within groups • A solution: sparse group lasso (Simon et al. , 2012)

– Two tuning parameters: grid search

2

1 21 1 1 1

minjpJ J J

j j j ijj j j i

λ λ β= = = =

− + +

∑ ∑ ∑∑βy X β β


3.3 Other sparse regression methods

• SCAD penalty (Fan & Li, 2001) « smoothly clipped absolute deviation » – Non-convex

• Sparse PLS – Several extensions

• Chun & Keles (2010) • Le Cao et al. (2008)


4.Sparse PCA

• In PCA, each PC is a linear combination of all the original variables : difficult to interpret the results

• Challenge of SPCA: obtain components easily interpretable (lot of zero loadings in principal factors)

• Principle of SPCA: modify PCA imposing lasso/elastic-net constraints to construct modified PCs with sparse loadings

• Warning: Sparse PCA does not provide a global selection of variables but a selection dimension by dimension : different from the regression context (Lasso, Elastic Net, …)


4.1 First attempts: • Simple PCA

– by Vines (2000) : integer loadings – Rousson, V. and Gasser, T. (2004) : loadings (+ , 0, -)

• SCoTLASS (Simplified Component Technique –Lasso) by Jolliffe & al. (2003) : extra L1 constraints

2

1max with 1 and

p

jj

u t=

= = ≤∑u'Vu u u'u


SCotLass properties: • Non convex problem

usual PCAt 1 no solution

1 only one nonzero coefficient

1

t p

t

t p

≥

<=

< <


Let the SVD of X be with the principal components

Ridge regression: Loadings can be recovered by regressing (ridge regression) PCs on the p variables PCA can be written as a regression-type optimization problem

2 2

β

ˆ arg minridge λ= +β Z - Xβ β

( ) ( )

' '

2

2ˆ ii

ii

withd

d λ=

2 '

-1' 'i,ridge i i

X X = VD V V V = I

β = X X + λ I X Xv v+ i=v v

'X = UDV Z = UD

4.2 S-PCA by Zou et al (2006)


is an approximation to , and the ith approximated component Produces sparse loadings with zero coefficients to

facilitate interpretation

Alternated algorithm between elastic net and SVD

Sparse PCA add a new penalty to produce sparse loadings:

ˆ arg minβ

λ= 2 2β Z - Xβ + β

ˆˆ

ˆiβv =β

ˆ iXv

Lasso penalty

iv

1 1λ+ β


4.3 S-PCA via regularized SVD

• Shen & Huang (2008) : starts from the SVD with a smooth penalty (L1, SCAD, etc.)

( )

( ) '

1

2

1min - '

kk

j j jj

p

jj

d

h vλ

=

=

=

+

∑

∑u,v

X u v

X uv


• Loss of orthogonality – SCotLass: orthogonal loadings but correlated

components – S-PCA: neither loadings, nor components are

orthogonal – Necessity of adjusting the % of explained

variance

29 ECDA 2013, Luxemburg 29

Data matrix X divided into J groups Xj of pj variables

Group Sparse PCA: compromise between SPCA and group Lasso

Goal: select groups of continuous variables (zero coefficients to entire blocks of variables)

Principle: replace the penalty function in the SPCA algorithm by that defined in the group Lasso

2 21 1

β

ˆ arg min λ λ= + +β Z - Xβ β β

2

1 1

ˆ arg minJ J

GL j j j jj j

pβ

λ= =

= − +∑ ∑β Z X β β

4.4 Group Sparse PCA


In MCA:

Selection of 1 column in the original table (categorical variable Xj )

= Selection of a block of pj indicator variables

in the complete disjunctive table

Sparse MCA : select categorical variables, not categories Principle: a straightforward extension of Group Sparse PCA for groups of indicator variables, with the chi-square

metric . Uses s-PCA r-SVD algorithm.

XJ

1 pJ

.

.

.

3

XJ1 … XJpj 1 0

.

.

.

0

0 1

.

.

.

0

Original table Complete disjunctive table

5.Sparse MCA


And be the matrix of standardised residuals: Singular Value Decomposition

Let F be the n x q disjunctive table divided by the number of units

( )1 1- -T2 2

r cF = D F - rc D

TF = UΛV

qr = F1 Tnc = F 1 ( )rD = diag r ( )cD = diag c

F


Properties MCA Sparse MCA

Uncorrelated Components TRUE FALSE

Orthogonal loadings TRUE FALSE

Barycentric property TRUE TRUE

% of inertia

Total inertia

100jtot

λ×

2

j.1,...,j-1Z

2

1

k

j=∑ j.1,...,j-1Z

1

1 1p

jj

pp =

−∑

are the residuals after adjusting for (regression projection) j.1,...,j-1Z1,...,j-1ZjZ

ECDA 2013, Luxemburg

Toy example: Dogs

Data:

n=27 breeds of dogs p=6 variables q=16 (total number of columns)

X : 27 x 6 matrix of categorical variables

K : 27 x 16 complete disjunctive table K=(K1, …, K6)

1 block = 1 Kj matrix


Toy example: Dogs

λ= 0:25 is a compromise between the number of variables selected and the % of variance lost.


Toy example: Comparison of the loadings


Single Nucleotide Polymorphisms

Data:

n=502 individuals p=537 SNPs (among more than 800 000 of the original data base, 15000 genes) q=1554 (total number of columns)

X : 502 x 537matrix of qualitative variables

K : 502 x 1554 complete disjunctive table K=(K1, …, K1554)

1 block =

1 SNP = 1 Kj matrix

Application on genetic data




ECDA 2013, Luxemburg

λ= 0:005: CPEV= 0:32% and 174 columns selected on Comp 1

Comparison of the loadings


. .


6.Conclusions and perspectives

• Sparse techniques provide elegant and efficient solutions to problems posed by high-dimensional data: – A new generation of data analysis methods with few

restrictive hypothesis

• Very powerful in a context of variable selection in high dimension issues: – reduce noise as well as computation time.


• 2 new methods in a unsupervised multiblock data context: Group Sparse PCA for continuous variables, and Sparse MCA for categorical variables – Both methods produce sparse loadings structures that

makes easier the interpretation and the comprehension of the results

– Possibility of selecting superblocks (genes)

• Research in progress: – Extension of Sparse MCA to select groups and

predictors within a group (sparsity within groups) • sparsity at both group and individual feature levels • compromise between Sparse MCA and sparse group

lasso developed by Simon et al. (2012).


Thanks for your attention


References • Chun, H. and Keles, S. (2010), Sparse partial least squares for

simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.

• Fan, J. and Li, R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of American Statistical Association, 96, 1348-1360, 2001

• Hastie T., Tibshirani R., Friedman J. (2009) The elements of statistical learning, 2nd edition, Springer, 2009

• Jolliffe, I.T. , Trendafilov, N.T. and Uddin, M. (2003) A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics, 12, 531–547,

• Rousson, V. , Gasser, T. (2004), Simple component analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 53,539-555

• Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99:1015-1034. ECDA 2013, Luxemburg 45

• Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2012) A Sparse-Group Lasso. Journal of Computational and Graphical Statistics,

• Tenenhaus M. (1998) La régression PLS, Technip • Tibshirani, R. (1996) Regression shrinkage and selection via

the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288

• Vines, S.K., (2000) Simple principal components, Journal of the Royal Statistical Society: Series C (Applied Statistics), 49, 441-451

• Yuan, M., Lin, Y. (2007) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68, 49-67,

• Zou, H., Hastie , T. (2005) Regularization and variable selection via the elastic net. Journal of Computational and Graphical Statistics, 67, 301-320,

• Zou, H., Hastie, T. and Tibshirani, R. (2006) Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15, 265-286.

• H. Zou, T. Hastie, R. Tibshirani, (2007), On the “degrees of freedom” of the lasso, The Annals of Statistics, 35, 5, 2173–2192.


• Topics: PLS Regression, PLS Path Modeling and their related methods with application in Management, Social Sciences, Chemometrics, Sensory Analysis, Industry and Life Sciences including genomics.

• Keynote speakers

Anne-Laure BOULESTEIX LMU München - Germany Peter BÜHLMANN ETH Zürich - Switzerland Mohamed HANAFI ONIRIS, Nantes-France

http://www.pls14.org/




from sparse regression to sparse multiple correspondence ... · from sparse regression to sparse...

Documents