luis francisco sanchez merchante to cite this version

150
HAL Id: tel-00868847 https://tel.archives-ouvertes.fr/tel-00868847 Submitted on 2 Oct 2013 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Learning algorithms for sparse classification Luis Francisco Sanchez Merchante To cite this version: Luis Francisco Sanchez Merchante. Learning algorithms for sparse classification. Computer science. Université de Technologie de Compiègne, 2013. English. NNT: 2013COMP2084. tel-00868847

Upload: others

Post on 22-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Luis Francisco Sanchez Merchante To cite this version

HAL Id tel-00868847httpstelarchives-ouvertesfrtel-00868847

Submitted on 2 Oct 2013

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents whether they are pub-lished or not The documents may come fromteaching and research institutions in France orabroad or from public or private research centers

Lrsquoarchive ouverte pluridisciplinaire HAL estdestineacutee au deacutepocirct et agrave la diffusion de documentsscientifiques de niveau recherche publieacutes ou noneacutemanant des eacutetablissements drsquoenseignement et derecherche franccedilais ou eacutetrangers des laboratoirespublics ou priveacutes

Learning algorithms for sparse classificationLuis Francisco Sanchez Merchante

To cite this versionLuis Francisco Sanchez Merchante Learning algorithms for sparse classification Computer scienceUniversiteacute de Technologie de Compiegravegne 2013 English NNT 2013COMP2084 tel-00868847

Par Luis Francisco SANCHEZ MERCHANTE

Thegravese preacutesenteacutee pour lrsquoobtention du grade de Docteur de lrsquoUTC

Learning algorithms for sparse classification

Soutenue le 07 juin 2013

Speacutecialiteacute Technologies de lrsquoInformation et des Systegravemes

D2084

Algorithmes drsquoestimation pour laclassification parcimonieuse

Luis Francisco Sanchez MerchanteUniversity of Compiegne

CompiegneFrance

ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo

Albert Espinosa

ldquoBe brave Take risks Nothing can substitute experiencerdquo

Paulo Coelho

Acknowledgements

If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

Contents

List of figures v

List of tables vii

Notation and Symbols ix

I Context and Foundations 1

1 Context 5

2 Regularization for Feature Selection 921 Motivations 9

22 Categorization of Feature Selection Techniques 11

23 Regularization 13

231 Important Properties 14

232 Pure Penalties 14

233 Hybrid Penalties 18

234 Mixed Penalties 19

235 Sparsity Considerations 19

236 Optimization Tools for Regularized Problems 21

II Sparse Linear Discriminant Analysis 25

Abstract 27

3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

32 Feature Selection in LDA Problems 30

321 Inertia Based 30

322 Regression Based 32

4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

411 Penalized Optimal Scoring Problem 36

412 Penalized Canonical Correlation Analysis 37

i

Contents

413 Penalized Linear Discriminant Analysis 39

414 Summary 40

42 Practicalities 41

421 Solution of the Penalized Optimal Scoring Regression 41

422 Distance Evaluation 42

423 Posterior Probability Evaluation 43

424 Graphical Representation 43

43 From Sparse Optimal Scoring to Sparse LDA 43

431 A Quadratic Variational Form 44

432 Group-Lasso OS as Penalized LDA 47

5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

511 Cholesky decomposition 52

512 Numerical Stability 52

52 Score Matrix 52

53 Optimality Conditions 53

54 Active and Inactive Sets 54

55 Penalty Parameter 54

56 Options and Variants 55

561 Scaling Variables 55

562 Sparse Variant 55

563 Diagonal Variant 55

564 Elastic net and Structured Variant 55

6 Experimental Results 5761 Normalization 57

62 Decision Thresholds 57

63 Simulated Data 58

64 Gene Expression Data 60

65 Correlated Data 63

Discussion 63

III Sparse Clustering Analysis 67

Abstract 69

7 Feature Selection in Mixture Models 7171 Mixture Models 71

711 Model 71

712 Parameter Estimation The EM Algorithm 72

ii

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

iii

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography
Page 2: Luis Francisco Sanchez Merchante To cite this version

Par Luis Francisco SANCHEZ MERCHANTE

Thegravese preacutesenteacutee pour lrsquoobtention du grade de Docteur de lrsquoUTC

Learning algorithms for sparse classification

Soutenue le 07 juin 2013

Speacutecialiteacute Technologies de lrsquoInformation et des Systegravemes

D2084

Algorithmes drsquoestimation pour laclassification parcimonieuse

Luis Francisco Sanchez MerchanteUniversity of Compiegne

CompiegneFrance

ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo

Albert Espinosa

ldquoBe brave Take risks Nothing can substitute experiencerdquo

Paulo Coelho

Acknowledgements

If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

Contents

List of figures v

List of tables vii

Notation and Symbols ix

I Context and Foundations 1

1 Context 5

2 Regularization for Feature Selection 921 Motivations 9

22 Categorization of Feature Selection Techniques 11

23 Regularization 13

231 Important Properties 14

232 Pure Penalties 14

233 Hybrid Penalties 18

234 Mixed Penalties 19

235 Sparsity Considerations 19

236 Optimization Tools for Regularized Problems 21

II Sparse Linear Discriminant Analysis 25

Abstract 27

3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

32 Feature Selection in LDA Problems 30

321 Inertia Based 30

322 Regression Based 32

4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

411 Penalized Optimal Scoring Problem 36

412 Penalized Canonical Correlation Analysis 37

i

Contents

413 Penalized Linear Discriminant Analysis 39

414 Summary 40

42 Practicalities 41

421 Solution of the Penalized Optimal Scoring Regression 41

422 Distance Evaluation 42

423 Posterior Probability Evaluation 43

424 Graphical Representation 43

43 From Sparse Optimal Scoring to Sparse LDA 43

431 A Quadratic Variational Form 44

432 Group-Lasso OS as Penalized LDA 47

5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

511 Cholesky decomposition 52

512 Numerical Stability 52

52 Score Matrix 52

53 Optimality Conditions 53

54 Active and Inactive Sets 54

55 Penalty Parameter 54

56 Options and Variants 55

561 Scaling Variables 55

562 Sparse Variant 55

563 Diagonal Variant 55

564 Elastic net and Structured Variant 55

6 Experimental Results 5761 Normalization 57

62 Decision Thresholds 57

63 Simulated Data 58

64 Gene Expression Data 60

65 Correlated Data 63

Discussion 63

III Sparse Clustering Analysis 67

Abstract 69

7 Feature Selection in Mixture Models 7171 Mixture Models 71

711 Model 71

712 Parameter Estimation The EM Algorithm 72

ii

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

iii

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography
Page 3: Luis Francisco Sanchez Merchante To cite this version

Algorithmes drsquoestimation pour laclassification parcimonieuse

Luis Francisco Sanchez MerchanteUniversity of Compiegne

CompiegneFrance

ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo

Albert Espinosa

ldquoBe brave Take risks Nothing can substitute experiencerdquo

Paulo Coelho

Acknowledgements

If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

Contents

List of figures v

List of tables vii

Notation and Symbols ix

I Context and Foundations 1

1 Context 5

2 Regularization for Feature Selection 921 Motivations 9

22 Categorization of Feature Selection Techniques 11

23 Regularization 13

231 Important Properties 14

232 Pure Penalties 14

233 Hybrid Penalties 18

234 Mixed Penalties 19

235 Sparsity Considerations 19

236 Optimization Tools for Regularized Problems 21

II Sparse Linear Discriminant Analysis 25

Abstract 27

3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

32 Feature Selection in LDA Problems 30

321 Inertia Based 30

322 Regression Based 32

4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

411 Penalized Optimal Scoring Problem 36

412 Penalized Canonical Correlation Analysis 37

i

Contents

413 Penalized Linear Discriminant Analysis 39

414 Summary 40

42 Practicalities 41

421 Solution of the Penalized Optimal Scoring Regression 41

422 Distance Evaluation 42

423 Posterior Probability Evaluation 43

424 Graphical Representation 43

43 From Sparse Optimal Scoring to Sparse LDA 43

431 A Quadratic Variational Form 44

432 Group-Lasso OS as Penalized LDA 47

5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

511 Cholesky decomposition 52

512 Numerical Stability 52

52 Score Matrix 52

53 Optimality Conditions 53

54 Active and Inactive Sets 54

55 Penalty Parameter 54

56 Options and Variants 55

561 Scaling Variables 55

562 Sparse Variant 55

563 Diagonal Variant 55

564 Elastic net and Structured Variant 55

6 Experimental Results 5761 Normalization 57

62 Decision Thresholds 57

63 Simulated Data 58

64 Gene Expression Data 60

65 Correlated Data 63

Discussion 63

III Sparse Clustering Analysis 67

Abstract 69

7 Feature Selection in Mixture Models 7171 Mixture Models 71

711 Model 71

712 Parameter Estimation The EM Algorithm 72

ii

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

iii

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography
Page 4: Luis Francisco Sanchez Merchante To cite this version

ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo

Albert Espinosa

ldquoBe brave Take risks Nothing can substitute experiencerdquo

Paulo Coelho

Acknowledgements

If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

Contents

List of figures v

List of tables vii

Notation and Symbols ix

I Context and Foundations 1

1 Context 5

2 Regularization for Feature Selection 921 Motivations 9

22 Categorization of Feature Selection Techniques 11

23 Regularization 13

231 Important Properties 14

232 Pure Penalties 14

233 Hybrid Penalties 18

234 Mixed Penalties 19

235 Sparsity Considerations 19

236 Optimization Tools for Regularized Problems 21

II Sparse Linear Discriminant Analysis 25

Abstract 27

3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

32 Feature Selection in LDA Problems 30

321 Inertia Based 30

322 Regression Based 32

4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

411 Penalized Optimal Scoring Problem 36

412 Penalized Canonical Correlation Analysis 37

i

Contents

413 Penalized Linear Discriminant Analysis 39

414 Summary 40

42 Practicalities 41

421 Solution of the Penalized Optimal Scoring Regression 41

422 Distance Evaluation 42

423 Posterior Probability Evaluation 43

424 Graphical Representation 43

43 From Sparse Optimal Scoring to Sparse LDA 43

431 A Quadratic Variational Form 44

432 Group-Lasso OS as Penalized LDA 47

5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

511 Cholesky decomposition 52

512 Numerical Stability 52

52 Score Matrix 52

53 Optimality Conditions 53

54 Active and Inactive Sets 54

55 Penalty Parameter 54

56 Options and Variants 55

561 Scaling Variables 55

562 Sparse Variant 55

563 Diagonal Variant 55

564 Elastic net and Structured Variant 55

6 Experimental Results 5761 Normalization 57

62 Decision Thresholds 57

63 Simulated Data 58

64 Gene Expression Data 60

65 Correlated Data 63

Discussion 63

III Sparse Clustering Analysis 67

Abstract 69

7 Feature Selection in Mixture Models 7171 Mixture Models 71

711 Model 71

712 Parameter Estimation The EM Algorithm 72

ii

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

iii

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography
Page 5: Luis Francisco Sanchez Merchante To cite this version

Acknowledgements

If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

Contents

List of figures v

List of tables vii

Notation and Symbols ix

I Context and Foundations 1

1 Context 5

2 Regularization for Feature Selection 921 Motivations 9

22 Categorization of Feature Selection Techniques 11

23 Regularization 13

231 Important Properties 14

232 Pure Penalties 14

233 Hybrid Penalties 18

234 Mixed Penalties 19

235 Sparsity Considerations 19

236 Optimization Tools for Regularized Problems 21

II Sparse Linear Discriminant Analysis 25

Abstract 27

3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

32 Feature Selection in LDA Problems 30

321 Inertia Based 30

322 Regression Based 32

4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

411 Penalized Optimal Scoring Problem 36

412 Penalized Canonical Correlation Analysis 37

i

Contents

413 Penalized Linear Discriminant Analysis 39

414 Summary 40

42 Practicalities 41

421 Solution of the Penalized Optimal Scoring Regression 41

422 Distance Evaluation 42

423 Posterior Probability Evaluation 43

424 Graphical Representation 43

43 From Sparse Optimal Scoring to Sparse LDA 43

431 A Quadratic Variational Form 44

432 Group-Lasso OS as Penalized LDA 47

5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

511 Cholesky decomposition 52

512 Numerical Stability 52

52 Score Matrix 52

53 Optimality Conditions 53

54 Active and Inactive Sets 54

55 Penalty Parameter 54

56 Options and Variants 55

561 Scaling Variables 55

562 Sparse Variant 55

563 Diagonal Variant 55

564 Elastic net and Structured Variant 55

6 Experimental Results 5761 Normalization 57

62 Decision Thresholds 57

63 Simulated Data 58

64 Gene Expression Data 60

65 Correlated Data 63

Discussion 63

III Sparse Clustering Analysis 67

Abstract 69

7 Feature Selection in Mixture Models 7171 Mixture Models 71

711 Model 71

712 Parameter Estimation The EM Algorithm 72

ii

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

iii

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography
Page 6: Luis Francisco Sanchez Merchante To cite this version

Contents

List of figures v

List of tables vii

Notation and Symbols ix

I Context and Foundations 1

1 Context 5

2 Regularization for Feature Selection 921 Motivations 9

22 Categorization of Feature Selection Techniques 11

23 Regularization 13

231 Important Properties 14

232 Pure Penalties 14

233 Hybrid Penalties 18

234 Mixed Penalties 19

235 Sparsity Considerations 19

236 Optimization Tools for Regularized Problems 21

II Sparse Linear Discriminant Analysis 25

Abstract 27

3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

32 Feature Selection in LDA Problems 30

321 Inertia Based 30

322 Regression Based 32

4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

411 Penalized Optimal Scoring Problem 36

412 Penalized Canonical Correlation Analysis 37

i

Contents

413 Penalized Linear Discriminant Analysis 39

414 Summary 40

42 Practicalities 41

421 Solution of the Penalized Optimal Scoring Regression 41

422 Distance Evaluation 42

423 Posterior Probability Evaluation 43

424 Graphical Representation 43

43 From Sparse Optimal Scoring to Sparse LDA 43

431 A Quadratic Variational Form 44

432 Group-Lasso OS as Penalized LDA 47

5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

511 Cholesky decomposition 52

512 Numerical Stability 52

52 Score Matrix 52

53 Optimality Conditions 53

54 Active and Inactive Sets 54

55 Penalty Parameter 54

56 Options and Variants 55

561 Scaling Variables 55

562 Sparse Variant 55

563 Diagonal Variant 55

564 Elastic net and Structured Variant 55

6 Experimental Results 5761 Normalization 57

62 Decision Thresholds 57

63 Simulated Data 58

64 Gene Expression Data 60

65 Correlated Data 63

Discussion 63

III Sparse Clustering Analysis 67

Abstract 69

7 Feature Selection in Mixture Models 7171 Mixture Models 71

711 Model 71

712 Parameter Estimation The EM Algorithm 72

ii

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

iii

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography
Page 7: Luis Francisco Sanchez Merchante To cite this version

Contents

413 Penalized Linear Discriminant Analysis 39

414 Summary 40

42 Practicalities 41

421 Solution of the Penalized Optimal Scoring Regression 41

422 Distance Evaluation 42

423 Posterior Probability Evaluation 43

424 Graphical Representation 43

43 From Sparse Optimal Scoring to Sparse LDA 43

431 A Quadratic Variational Form 44

432 Group-Lasso OS as Penalized LDA 47

5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

511 Cholesky decomposition 52

512 Numerical Stability 52

52 Score Matrix 52

53 Optimality Conditions 53

54 Active and Inactive Sets 54

55 Penalty Parameter 54

56 Options and Variants 55

561 Scaling Variables 55

562 Sparse Variant 55

563 Diagonal Variant 55

564 Elastic net and Structured Variant 55

6 Experimental Results 5761 Normalization 57

62 Decision Thresholds 57

63 Simulated Data 58

64 Gene Expression Data 60

65 Correlated Data 63

Discussion 63

III Sparse Clustering Analysis 67

Abstract 69

7 Feature Selection in Mixture Models 7171 Mixture Models 71

711 Model 71

712 Parameter Estimation The EM Algorithm 72

ii

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

iii

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography
Page 8: Luis Francisco Sanchez Merchante To cite this version

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

iii

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography
Page 9: Luis Francisco Sanchez Merchante To cite this version

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography
Page 10: Luis Francisco Sanchez Merchante To cite this version
Page 11: Luis Francisco Sanchez Merchante To cite this version
Page 12: Luis Francisco Sanchez Merchante To cite this version
Page 13: Luis Francisco Sanchez Merchante To cite this version
Page 14: Luis Francisco Sanchez Merchante To cite this version
Page 15: Luis Francisco Sanchez Merchante To cite this version
Page 16: Luis Francisco Sanchez Merchante To cite this version
Page 17: Luis Francisco Sanchez Merchante To cite this version
Page 18: Luis Francisco Sanchez Merchante To cite this version
Page 19: Luis Francisco Sanchez Merchante To cite this version
Page 20: Luis Francisco Sanchez Merchante To cite this version
Page 21: Luis Francisco Sanchez Merchante To cite this version
Page 22: Luis Francisco Sanchez Merchante To cite this version
Page 23: Luis Francisco Sanchez Merchante To cite this version
Page 24: Luis Francisco Sanchez Merchante To cite this version
Page 25: Luis Francisco Sanchez Merchante To cite this version
Page 26: Luis Francisco Sanchez Merchante To cite this version
Page 27: Luis Francisco Sanchez Merchante To cite this version
Page 28: Luis Francisco Sanchez Merchante To cite this version
Page 29: Luis Francisco Sanchez Merchante To cite this version
Page 30: Luis Francisco Sanchez Merchante To cite this version
Page 31: Luis Francisco Sanchez Merchante To cite this version
Page 32: Luis Francisco Sanchez Merchante To cite this version
Page 33: Luis Francisco Sanchez Merchante To cite this version
Page 34: Luis Francisco Sanchez Merchante To cite this version
Page 35: Luis Francisco Sanchez Merchante To cite this version
Page 36: Luis Francisco Sanchez Merchante To cite this version
Page 37: Luis Francisco Sanchez Merchante To cite this version
Page 38: Luis Francisco Sanchez Merchante To cite this version
Page 39: Luis Francisco Sanchez Merchante To cite this version
Page 40: Luis Francisco Sanchez Merchante To cite this version
Page 41: Luis Francisco Sanchez Merchante To cite this version
Page 42: Luis Francisco Sanchez Merchante To cite this version
Page 43: Luis Francisco Sanchez Merchante To cite this version
Page 44: Luis Francisco Sanchez Merchante To cite this version
Page 45: Luis Francisco Sanchez Merchante To cite this version
Page 46: Luis Francisco Sanchez Merchante To cite this version
Page 47: Luis Francisco Sanchez Merchante To cite this version
Page 48: Luis Francisco Sanchez Merchante To cite this version
Page 49: Luis Francisco Sanchez Merchante To cite this version
Page 50: Luis Francisco Sanchez Merchante To cite this version
Page 51: Luis Francisco Sanchez Merchante To cite this version
Page 52: Luis Francisco Sanchez Merchante To cite this version
Page 53: Luis Francisco Sanchez Merchante To cite this version
Page 54: Luis Francisco Sanchez Merchante To cite this version
Page 55: Luis Francisco Sanchez Merchante To cite this version
Page 56: Luis Francisco Sanchez Merchante To cite this version
Page 57: Luis Francisco Sanchez Merchante To cite this version
Page 58: Luis Francisco Sanchez Merchante To cite this version
Page 59: Luis Francisco Sanchez Merchante To cite this version
Page 60: Luis Francisco Sanchez Merchante To cite this version
Page 61: Luis Francisco Sanchez Merchante To cite this version
Page 62: Luis Francisco Sanchez Merchante To cite this version
Page 63: Luis Francisco Sanchez Merchante To cite this version
Page 64: Luis Francisco Sanchez Merchante To cite this version
Page 65: Luis Francisco Sanchez Merchante To cite this version
Page 66: Luis Francisco Sanchez Merchante To cite this version
Page 67: Luis Francisco Sanchez Merchante To cite this version
Page 68: Luis Francisco Sanchez Merchante To cite this version
Page 69: Luis Francisco Sanchez Merchante To cite this version
Page 70: Luis Francisco Sanchez Merchante To cite this version
Page 71: Luis Francisco Sanchez Merchante To cite this version
Page 72: Luis Francisco Sanchez Merchante To cite this version
Page 73: Luis Francisco Sanchez Merchante To cite this version
Page 74: Luis Francisco Sanchez Merchante To cite this version
Page 75: Luis Francisco Sanchez Merchante To cite this version
Page 76: Luis Francisco Sanchez Merchante To cite this version
Page 77: Luis Francisco Sanchez Merchante To cite this version
Page 78: Luis Francisco Sanchez Merchante To cite this version
Page 79: Luis Francisco Sanchez Merchante To cite this version
Page 80: Luis Francisco Sanchez Merchante To cite this version
Page 81: Luis Francisco Sanchez Merchante To cite this version
Page 82: Luis Francisco Sanchez Merchante To cite this version
Page 83: Luis Francisco Sanchez Merchante To cite this version
Page 84: Luis Francisco Sanchez Merchante To cite this version
Page 85: Luis Francisco Sanchez Merchante To cite this version
Page 86: Luis Francisco Sanchez Merchante To cite this version
Page 87: Luis Francisco Sanchez Merchante To cite this version
Page 88: Luis Francisco Sanchez Merchante To cite this version
Page 89: Luis Francisco Sanchez Merchante To cite this version
Page 90: Luis Francisco Sanchez Merchante To cite this version
Page 91: Luis Francisco Sanchez Merchante To cite this version
Page 92: Luis Francisco Sanchez Merchante To cite this version
Page 93: Luis Francisco Sanchez Merchante To cite this version
Page 94: Luis Francisco Sanchez Merchante To cite this version
Page 95: Luis Francisco Sanchez Merchante To cite this version
Page 96: Luis Francisco Sanchez Merchante To cite this version
Page 97: Luis Francisco Sanchez Merchante To cite this version
Page 98: Luis Francisco Sanchez Merchante To cite this version
Page 99: Luis Francisco Sanchez Merchante To cite this version
Page 100: Luis Francisco Sanchez Merchante To cite this version
Page 101: Luis Francisco Sanchez Merchante To cite this version
Page 102: Luis Francisco Sanchez Merchante To cite this version
Page 103: Luis Francisco Sanchez Merchante To cite this version
Page 104: Luis Francisco Sanchez Merchante To cite this version
Page 105: Luis Francisco Sanchez Merchante To cite this version
Page 106: Luis Francisco Sanchez Merchante To cite this version
Page 107: Luis Francisco Sanchez Merchante To cite this version
Page 108: Luis Francisco Sanchez Merchante To cite this version
Page 109: Luis Francisco Sanchez Merchante To cite this version
Page 110: Luis Francisco Sanchez Merchante To cite this version
Page 111: Luis Francisco Sanchez Merchante To cite this version
Page 112: Luis Francisco Sanchez Merchante To cite this version
Page 113: Luis Francisco Sanchez Merchante To cite this version
Page 114: Luis Francisco Sanchez Merchante To cite this version
Page 115: Luis Francisco Sanchez Merchante To cite this version
Page 116: Luis Francisco Sanchez Merchante To cite this version
Page 117: Luis Francisco Sanchez Merchante To cite this version
Page 118: Luis Francisco Sanchez Merchante To cite this version
Page 119: Luis Francisco Sanchez Merchante To cite this version
Page 120: Luis Francisco Sanchez Merchante To cite this version
Page 121: Luis Francisco Sanchez Merchante To cite this version
Page 122: Luis Francisco Sanchez Merchante To cite this version
Page 123: Luis Francisco Sanchez Merchante To cite this version
Page 124: Luis Francisco Sanchez Merchante To cite this version
Page 125: Luis Francisco Sanchez Merchante To cite this version
Page 126: Luis Francisco Sanchez Merchante To cite this version
Page 127: Luis Francisco Sanchez Merchante To cite this version
Page 128: Luis Francisco Sanchez Merchante To cite this version
Page 129: Luis Francisco Sanchez Merchante To cite this version