sparse statistical modelling - uclrmjbale/stat/7.pdf · 2016-11-30 · sparse statistical modelling...

Sparse statistical modelling

Tom Bartlett

Sparse statistical modelling Tom Bartlett 1 / 28

Introduction

‘A sparse statistical model is one having only a small number ofnonzero parameters or weights.’[1]

The number of features or variables measured on a person or objectcan be very large (e.g., expression levels of ∼ 30000 genes)

These measurements are often highly correlated, i.e., contain muchredundant information

This scenario is particularly relevant in the age of ‘big-data’

1Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity:the lasso and generalizations. CRC Press, 2015

Outline

Sparse linear models

Sparse PCA

Sparse SVD

Sparse CCA

Sparse LDA

Sparse clustering

A linear model can be written as

yi =α +

p∑j=1

xijβj + εi , i = 1, ..., n

=α + x>i β + εi

Hence, the model can be fit by minimising the objective function

minimisea,β

{N∑i=1

(yi − α− x>i β)2

Adding a penalisation term to the objective function makes thesolution more sparse:

minimisea,β

N∑i=1

(yi − α− x>i β)2 + λ‖β‖qq

}, where q = 1 or 2

The penalty term λ‖β‖qq means that only the bare minimum is used ofall the information available in the p predictor variables xij , j = 1, ...p.

minimisea,β

N∑i=1

(yi − α− x>i β)2 + λ‖β‖qq

q is typically chosen as q = 1 or q = 2, because these produce convexsolutions and hence are computationally much nicer!

q = 1 is called the ‘lasso’; it tends to set as many elements of β aspossible to zero

q = 2 is called ‘ridge regression’, and it tends to minimise the size ofall the elements of β

Penalisation is equally applicable to other types of linear models:logistic regression, generalised linear models etc

Sparse linear models - simple example

0.0 0.2 0.4 0.6 0.8 1.0

college

college4not−hs

funding

0.0 0.2 0.4 0.6 0.8 1.0

college

college4not−hs

funding

Ridge Regression

β 1/ β 1 β 2/ β 2

Crime-rate modelled according to 5 predictors: annual police funding indollars per resident (funding), percent of people 25 years and older withfour years of high school (hs), percent of 16- to 19-year olds not in highschool and not high school graduates (not-hs), percent of 18- to 24-yearolds in college (college), and percent of people 25 years and older with atleast four years of college (college4).

Sparse linear models - genomics exampleGene expression data, for p = 17280 genes, for nc = 530 cancersamples + nh = 61 healthy tissue samplesFit logistic (i.e., 2 class, cancer/healthy) lasso model using the R

package glmnet, selecting λ by cross-validationOut of 17280 possible genes for prediction, lasso chooses just these25 (shown with their fitted model coefficients)

ADAMTS5 -0.0666 HPD -0.00679 NUP210 0.00582ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0.297CA4 -0.151 IGSF10 -0.356 TACC3 0.128

CCDC36 -0.335 LRRTM2 -0.0711 TESC -0.0568CDH12 -0.253 LRRC3B -0.211 TRPM3 -1.24CES1 -0.302 MEG3 -0.022 TSLP -0.0841

COL10A1 0.747 MMP11 0.22 WDR51A 0.0722DPP6 -0.107 NUAK2 0.0354 WISP1 0.14HHATL -0.0665

Caveat: these are not necessarily the only ‘predictive’ genes. If weremoved these genes from the data-set and fitted the model again,lasso would choose an entirely new set of genes which might bealmost as good at predicting!

Sparse PCA

Ordinary PCA finds v by carrying out the optimisation:

maximise‖v‖2=1

with X ∈ Rn×p (i.e., n samples and p variables).

With p >> n, the eigenvectors of the sample covariance matrixX>X/n are not necessarily close to those of the population covariancematrix [2].

Hence ordinary PCA can fail in this context. This motivates sparsePCA, in which many entries of v are encouraged to be zero, byfinding v by carrying out the optimisation:

maximise‖v‖2=1

{v>X>Xv

}, subject to: ‖v‖1 ≤ t.

In effect this discards some variables such that p is closer to n.2Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components

analysis”. In: Annals of statistics (2001), pp. 295–327

Sparse SVD

The SVD of a matrix X ∈ Rn×p, with n > p, can be expressed asX = UDV>, where U ∈ Rn×p and V ∈ Rp×p are orthogonal andD ∈ Rp×p is diagonal. The SVD can hence be found by carrying outthe optimisation:

minimiseU∈Rn×p ,V∈Rp×p ,D∈Rp×p

‖X−UDV>‖2.

Hence, a sparse SVD with rank r can be obtained by carrying out theoptimisation:

minimiseU∈Rn×r ,V∈Rp×r ,D∈Rr×r

{‖X−UDV>‖2 + λ1‖U‖1 + λ2‖V‖1

This allows SVD to be applied to the p > n scenario.

Sparse PCA and SVD - an algorithm

SVD is a generalisation of PCA. Hence, algorithms to solve the SVDproblem can be applied to the PCA problem

The sparse PCA can thus be re-formulated as:

maximise‖u‖2=‖v‖2=1

}, subject to: ‖v‖1 ≤ t,

which is biconvex in u and v and can be solved by alternatingbetween the updates:

u← Xv

‖Xv‖2, and v←

Sλ(X>u

)‖Sλ (X>u) ‖2

where Sλ is the soft-thresholding operator Sλ = sign(x) (|x | − λ)+.

Sparse PCA - simulation study

Define Σ as a p × p block-diagonalmatrix, with p = 200 and 10 blocksof 1s of size 20× 20.

Hence, we would expect there to be 10independent components of variationin the corresponding distribution.

Generate n samples x ∼ Normal(0, Σ)

Estimate Σ =∑

(x− x)(x− x)>/n

Correlate eigenvectors of Σ witheigenvectors of Σ

Repeat 100 times for eachdifferent value of n

0.2 0.4 0.6 0.8 1.00.0

Top 10 PCs

The plot shows the means ofthese correlations over the100 repetitions for differentvalues of n.

An implementation of sparse PCA isavailable in the R package PMA as thefunction spca. It proceeds similarlyto the algorithm described earlier,which is presented in more detail byWitten, Tibshirani and Hastie [3].

I applied this function to the samesimulation as described in theprevious slide.

The scale of the penalisation is in termsof ‖u‖1, with ‖u‖1 =

√p being the

minimum and ‖u‖1 = 1 being themaximum permissible values.

0.2 0.4 0.6 0.8 1.00.0

Top 10 PCs

The plot shows the resultwith ‖u‖1 =

3Daniela M Witten, Robert Tibshirani, and Trevor Hastie. “A penalized matrixdecomposition, with applications to sparse principal components and canonical correlationanalysis”. In: Biostatistics (2009), kxp008

0.2 0.4 0.6 0.8 1.00.0

Top 10 PCs

√p/2.

0.2 0.4 0.6 0.8 1.00.0

Top 10 PCs

√p/3.

Sparse PCA - real data example

I carried out PCA on expression levelsof 10138 genes in individual cellsfrom developing brains

There are many different cell types inthe data - some mature, someimmature, and some in between

Different cell-types are characterised bydifferent gene expression profiles

We would therefore expect to be ablevisualise some separation of thecell-types by dimensionality reductionto three dimensions

The plot shows the cellsplotted in terms of the topthree (standard) PCAcomponents.

Sparse PCA - real data example

The plot shows the cells interms of the top three sparsePCA components, with‖u‖1 = 0.1

√p (i.e., a high

level of regularisation).

The plot shows the cells interms of the top three sparsePCA components, with‖u‖1 = 0.8

√p (i.e., a low

level of regularisation).

Sparse CCA

In CCA, the aim is to find coefficient vectors u ∈ Rp and v ∈ Rq

which project the data-matrices X ∈ Rn×p and Y ∈ Rn×q so as tomaximise the correlations between these projections.

Whereas PCA aims to find the ‘direction’ of maximum variance in asingle data-matrix, CCA aims to find the ‘directions’ in the twodata-matrices in which the variances best explain each other.

The CCA problem can be solved by carrying out the optimisation:

maximiseu∈Rp ,v∈Rq

Cor(Xu, Yv)

This problem is not well posed for n < max(p, q), in which case u andv can be found which trivially give Cor(Xu, Yv) = 1.

Sparse CCA solves this problem by carrying out the optimisation:

maximiseu∈Rp ,v∈Rq

Cor(Xu, Yv), subject to ‖u‖1 < t1 and ‖v‖1 < t2.

Sparse CCA - real data example

‘Cell cycle’ is a biological processinvolved in the replication of cells

Cell-cycle can be thought of as a latentprocess which is not directlyobservable in genomics data

It is driven by a small set of genes(particularly cyclins and cyclin-dependent kinases) from which itmay be inferred

It has an effect on the expression of verymany genes: hence it can also tendto act as a confounding factor whenmodelling many other biologicalprocesses

Used CCA here as anexploratory tool, with Y thedata for the cell cycle genes,and X the data for all theother genes.

Sparse LDA

LDA assigns item i to a group G based a corresponding data-vectorxi , according to the posterior probability:

P(G = k|xi ) =πk fk(xi )∑Kl=1 πl fl(xi )

, with

fk(xi ) =1

(2π)p/2|Σ|1/2exp

2(xi − µk)>Σ−1(xi − µk)

with prior πk and mean µk for group k, and covariance Σ.

This assignment takes place by constructing ‘decision boundaries’between classes k and l :

logP(G = k |xi )P(G = l |xi )

= logπkπl

+ x>i Σ−1(µk − µl)

2(µk + µl)

>Σ−1(µk − µl)

Because this boundary is linear in xi , we get the name LDA.

Sparse LDAThe decision boundary

logP(G = k |xi )P(G = l |xi )

= logπkπl

+ x>i Σ−1(µk − µl)

2(µk + µl)

>Σ−1(µk − µl)

then naturally leads to the decision rule:

G (xi ) = argmaxk

{log πk + x>i Σ−1µk − µ>k Σ−1µk

By assuming Σ is diagonal, i.e., there is no covariance between the pdimensions, this decision rule can be reduced to the nearest centroidsclassifier:

G (xi ) = argmink

(xj − µjk)2

σ2j− log πk

Typically, Σ (or σ) are estimated from the data as Σ (or σ), and theµk are estimated as µk whilst training the classifier.

Sparse LDAThe nearest centroids classifier

G (xi ) = argmink

(xj − µjk)2

σ2j− log πk

will typically use all p variables. This is often unnecessary and canlead to overfitting in high-dimensional contexts. The nearest shrunkencentroids classifier deals with this issue.Define µ = x + αk , where x is the data-mean across all classes, andαk is the class-specific deviation of the mean from x. Then, thenearest shrunken centroids classifier proceeds with the optimisation:

minimiseαk∈Rp ,k∈{1,...,K}

K∑k=1

∑i∈Ck

p∑j=1

(xij − xj − αjk)2

+λK∑

p∑j=1

√nkσ2|αjk |

where Ck and nk are the set and number of samples in group k .

Sparse LDA

Hence, the αk estimated from the optimisation

minimiseαk∈Rp ,k∈{1,...,K}

K∑k=1

∑i∈Ck

p∑j=1

(xij − xj − αjk)2

+λK∑

p∑j=1

√nkσ2|αjk |

can be used to estimate the shrunken centroids µ = x + αk , thustraining the classifier:

G (xi ) = argmink

(xj − µjk)2

σ2j− log πk

Sparse LDA - real data example

I applied nearest (shrunken) centroids toexpression data for 14349 genes, for347 cells of different types:leukocytes (54); lymphoblastic cells(88); fetal brain cells (16wk, 26;21wk, 24); fibroblasts (37); ductalcarcinoma (22); keratinocytes (40);B lymphoblasts (17); iPS cells (24);neural progenitors (15).

Used R packages MASS, and pamr [4].Carried out 100 repetitions of 3-foldCV. Plots show normalised mutualinformation (NMI), adjusted Randindex (ARI) and prediction accuracy.

0 5 10 15 20 25 300.0

Sparsity threshold

0 5 10 15 20 25 300.0

Sparsity threshold

0 5 10 15 20 25 300.0

Sparsity threshold

Sparse LDA quantile (over 300 predictions)

100% 75% 50% 25% 0%

Regular LDA quantile (over 300 predictions)

100% 75% 50% 25% 0%

4Robert Tibshirani et al. “Class prediction by nearest shrunken centroids, with applicationsto DNA microarrays”. In: Statistical Science (2003), pp. 104–117

Sparse clustering

Many clustering methods, such hierarchical clustering, are based on adissimilarity measure Di ,i ′ =

∑pj=1 di ,i ′,j between samples i and i ′.

One popular choice of dissimilarity measure is the euclidean distance.

In high-dimensions, it is often unnecessary to use information from allof the p dimensions.

A weighted dissimilarity measure Di ,i ′ =∑p

j=1 wjdi ,i ′,j can be a usefulapproach to this problem. This can be obtained by the sparse matrixdecomposition:

maximiseu∈Rn2 ,w∈Rp

u>∆w, subject to ‖u‖2 ≤ 1, ‖w‖2 ≤ 1,

‖w‖1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p},

where w is vector of the weights wj , j ∈ {1, ..., p}, and ∆ ∈ Rn2×p isthe dissimilarity components arranged such that each row of ∆corresponds to the di ,i ′,j , j ∈ {1, ..., p} for a pair of samples i , i ′.

This weighted dissimilarity measure can then be used for sparseclustering, such as sparse hierarchical clustering.

Sparse clustering

Some clustering methods, such as K-means, need a slightly modifiedapproach.

K-means seeks to minimise the within-cluster sum of squares

K∑k=1

∑i∈Ck

‖xi − xk‖22 =1

K∑k=1

∑i ,i ′∈Ck

‖xi − xi ′‖22

where Ck is the set of samples in cluster k and xk is thecorresponding centroid.

Hence, a weighted K-means could proceed according to theoptimisation:

minimisew∈Rp

K∑k=1

∑i ,i ′∈Ck

di ,i ′,j

where di ,i ′,j = (xij − xi ′j)2, and nk is the number of samples

in cluster k .

Sparse clustering

However, for the optimisation

minimisew∈Rp

K∑k=1

∑i ,i ′∈Ck

di ,i ′,j

it is not possible to choose a set of constraints which guarantee anon-pathological solution as well as convexity.

Instead, the between-cluster sum of squares can be maximised:

maximisew∈Rp

n∑i=1

n∑i ′=1

di ,i ′,j −K∑

∑i ,i ′∈Ck

di ,i ′,j

subject to ‖w‖2 ≤ 1, ‖w‖1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p}.

Sparse clustering - real data examples

Applied (sparse) hierarchalclustering to the samebenchmark expressiondata-set (14349 genes, for347 cells of different types).

Used R package sparcl [5] forthe sparse clustering. Plotsshow normalised mutualinformation (NMI) andadjusted Rand index (ARI)comparing sparse withstandard clustering.

2 5 10 20 50 100 200 500 10000.0

L1 bound

2 5 10 20 50 100 200 500 10000.0

L1 bound

Sparse hierarchical clustering hierarchical clustering

5Daniela M Witten and Robert Tibshirani. “A framework for feature selection in clustering”.In: Journal of the American Statistical Association (2012)

Sparse clustering - real data examples

Applied (sparse) k-means tothe same benchmarkexpression data-set (14349genes, for 347 cells ofdifferent types).

Used R package sparcl for thesparse clustering. Plotsshow normalised mutualinformation (NMI) andadjusted Rand index (ARI)comparing sparse withstandard clustering.

2 5 10 20 50 100 200 500 10000.0

L1 bound

2 5 10 20 50 100 200 500 10000.0

L1 bound

Sparse k−means k−means

Sparse clustering - real data examplesSpectral clustering essentially

uses k-means clustering (orsimilar) in dimensionally-reduced (e.g., PCA) space.

Applied standard k-means insparse-PCA space to thesame benchmark expressiondata-set (14349 genes, for347 cells of different types).

Offers computationaladvantages, running in 9seconds on a 2.8GHzMacbook, compared with19 seconds for standardk-means, and 35 secondsfor sparse k-means.

0.1 0.2 0.5 1.00.0

L1 bound / sqrt(n)

0.1 0.2 0.5 1.00.0

L1 bound / sqrt(n)

Sparse spectral k−means k−means

sparse statistical modelling - uclrmjbale/stat/7.pdf · 2016-11-30 · sparse statistical modelling...

Documents

modelling sparse generalized longitudinal observations

math 2016 (13177) statistical modelling

introduction to bayesian (geo)-statistical...

statistical downscaling and modelling using sparse variable...

hyperspectral unmixing geometrical, statistical, and sparse...

non-convex statistical optimization for sparse tensor

statistical modelling

statistical modelling in climate...

characterization and statistical modelling of …

process performance modelling statistical probabilistic

sparse optimization methods and statistical modeling with

spasm: a matlab toolbox for sparse statistical modeling ·...

mngt6232 data analysis & statistical modelling for...

metagenomeseq: statistical analysis for sparse high...

metagenomeseq: statistical analysis for sparse high

spatial processes and statistical modelling

[spanos] statistical foundations of econometric modelling

sparse statistical deformation model for the analysis of...

statistical growth modelling – exercise: paramaterisation...

chapter6. statistical inference : n-gram model over sparse...