statistical methods for the analysis of high …...2018/12/02  · statistical methods for the...

81
Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet 1,2 , Matthew Sutton 2 , Pierre Lafaye de Micheaux 3 , Boris Heljbum 4 , Rodolphe Thi ´ ebaut 4 1 University de Pau et Pays de l’Adour, LMAP, E2S-UPPA. 2 ARC Centre of Excellence for Mathematical and Statistical Frontiers, 3 UNSW, 4 Inria, SISTM Big Data PLS Methods December 2018, AASC Rotorua NZ 1/54

Upload: others

Post on 07-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Statistical Methods for the Analysis ofHigh-Dimensional and Massive Data

using R

Benoit Liquet1,2, Matthew Sutton2, Pierre Lafaye de Micheaux3,Boris Heljbum4, Rodolphe Thiebaut4

1 University de Pau et Pays de l’Adour, LMAP, E2S-UPPA.2 ARC Centre of Excellence for Mathematical and Statistical Frontiers,

3 UNSW,4 Inria, SISTM

Big Data PLS Methods December 2018, AASC Rotorua NZ 1/54

Page 2: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Contents

1. Motivation: Integrative Analysis for group data

2. Application on a HIV vaccine study

3. PLS approaches: SVD, PLS-W2A, canonical, regression

4. Sparse ModelsI Lasso penaltyI Group penaltyI Group and Sparse Group PLS

5. R package: sgPLS, sgsPLS, BIG-PLS

6. Regularized PLS Scalable to BIG-DATA

7. Concluding remarks

Big Data PLS Methods December 2018, AASC Rotorua NZ 2/54

Page 3: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Integrative Analysis

Wikipedia. Data integration “involves combining data residing in dif-ferent sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which in-clude both commercial and scientific domains”.

System Biology. Integrative Analysis: Analysis of heterogeneoustypes of data from inter-platform technologies.

Goal. Combine multiple types of data:I Contribute to a better understanding of biological mechanisms.I Have the potential to improve the diagnosis and treatments of

complex diseases.

Big Data PLS Methods December 2018, AASC Rotorua NZ 3/54

Page 4: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Example: Data definition

- n observations

- p variables

Xn

- n observations

- q variables

Yn

p q

I “Omics.” Y matrix: gene expression, X matrix: SNP (single nu-cleotide polymorphism). Many others such as proteomic, metabolomicdata.

I “Neuroimaging”. Y matrix: behavioral variables, X matrix: brainactivity (e.g., EEG, fMRI, NIRS)

I “Neuroimaging Genetics.” Y matrix: DTI (Diffusion Tensor Imag-ing), X matrix: SNP

I “Ecology/Environment.” Y matrix: Water quality variables , X ma-trix: Landscape variables

Big Data PLS Methods December 2018, AASC Rotorua NZ 4/54

Page 5: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Example: Data definition

- n observations

- p variables

Xn

- n observations

- q variables

Yn

p q

I “Omics.” Y matrix: gene expression, X matrix: SNP (single nu-cleotide polymorphism). Many others such as proteomic, metabolomicdata.

I “Neuroimaging”. Y matrix: behavioral variables, X matrix: brainactivity (e.g., EEG, fMRI, NIRS)

I “Neuroimaging Genetics.” Y matrix: DTI (Diffusion Tensor Imag-ing), X matrix: SNP

I “Ecology/Environment.” Y matrix: Water quality variables , X ma-trix: Landscape variables

Big Data PLS Methods December 2018, AASC Rotorua NZ 4/54

Page 6: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Example: Data definition

- n observations

- p variables

Xn

- n observations

- q variables

Yn

p q

I “Omics.” Y matrix: gene expression, X matrix: SNP (single nu-cleotide polymorphism). Many others such as proteomic, metabolomicdata.

I “Neuroimaging”. Y matrix: behavioral variables, X matrix: brainactivity (e.g., EEG, fMRI, NIRS)

I “Neuroimaging Genetics.” Y matrix: DTI (Diffusion Tensor Imag-ing), X matrix: SNP

I “Ecology/Environment.” Y matrix: Water quality variables , X ma-trix: Landscape variables

Big Data PLS Methods December 2018, AASC Rotorua NZ 4/54

Page 7: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Example: Data definition

- n observations

- p variables

Xn

- n observations

- q variables

Yn

p q

I “Omics.” Y matrix: gene expression, X matrix: SNP (single nu-cleotide polymorphism). Many others such as proteomic, metabolomicdata.

I “Neuroimaging”. Y matrix: behavioral variables, X matrix: brainactivity (e.g., EEG, fMRI, NIRS)

I “Neuroimaging Genetics.” Y matrix: DTI (Diffusion Tensor Imag-ing), X matrix: SNP

I “Ecology/Environment.” Y matrix: Water quality variables , X ma-trix: Landscape variables

Big Data PLS Methods December 2018, AASC Rotorua NZ 4/54

Page 8: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Example: Data definition

- n observations

- p variables

Xn

- n observations

- q variables

Yn

p q

I “Omics.” Y matrix: gene expression, X matrix: SNP (single nu-cleotide polymorphism). Many others such as proteomic, metabolomicdata.

I “Neuroimaging”. Y matrix: behavioral variables, X matrix: brainactivity (e.g., EEG, fMRI, NIRS)

I “Neuroimaging Genetics.” Y matrix: DTI (Diffusion Tensor Imag-ing), X matrix: SNP

I “Ecology/Environment.” Y matrix: Water quality variables , X ma-trix: Landscape variables

Big Data PLS Methods December 2018, AASC Rotorua NZ 4/54

Page 9: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Data: Constraints and AimsI Main constraint: colinearity among the variables, or situation with

p > n or q > n.

I Two Aims:1. Symmetric situation. Analyze the association between two blocks

of information. Analysis focused on shared information.2. Asymmetric situation. X matrix= predictors and Y matrix=

response variables. Analysis focused on prediction.

I Partial Least Square Family: dimension reduction approachesI PLS finds pairs of latent vectors ξ = Xu, ω = Yv with maximal

covariance.

e.g., ξ = u1 × SNP1 + u2 × SNP2 + · · ·+ up × SNPp

I Symmetric situation and Asymmetric situation.I Matrix decomposition of X and Y into successive latent variables.

Latent variables: are not directly observed but are rather inferred(through a mathematical model) from other variables that are observed(directly measured). Capture an underlying phenomenon (e.g., health).

Big Data PLS Methods December 2018, AASC Rotorua NZ 5/54

Page 10: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Data: Constraints and AimsI Main constraint: colinearity among the variables, or situation with

p > n or q > n.I Two Aims:

1. Symmetric situation. Analyze the association between two blocksof information. Analysis focused on shared information.

2. Asymmetric situation. X matrix= predictors and Y matrix=response variables. Analysis focused on prediction.

I Partial Least Square Family: dimension reduction approachesI PLS finds pairs of latent vectors ξ = Xu, ω = Yv with maximal

covariance.

e.g., ξ = u1 × SNP1 + u2 × SNP2 + · · ·+ up × SNPp

I Symmetric situation and Asymmetric situation.I Matrix decomposition of X and Y into successive latent variables.

Latent variables: are not directly observed but are rather inferred(through a mathematical model) from other variables that are observed(directly measured). Capture an underlying phenomenon (e.g., health).

Big Data PLS Methods December 2018, AASC Rotorua NZ 5/54

Page 11: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Data: Constraints and AimsI Main constraint: colinearity among the variables, or situation with

p > n or q > n.I Two Aims:

1. Symmetric situation. Analyze the association between two blocksof information. Analysis focused on shared information.

2. Asymmetric situation. X matrix= predictors and Y matrix=response variables. Analysis focused on prediction.

I Partial Least Square Family: dimension reduction approachesI PLS finds pairs of latent vectors ξ = Xu, ω = Yv with maximal

covariance.

e.g., ξ = u1 × SNP1 + u2 × SNP2 + · · ·+ up × SNPp

I Symmetric situation and Asymmetric situation.I Matrix decomposition of X and Y into successive latent variables.

Latent variables: are not directly observed but are rather inferred(through a mathematical model) from other variables that are observed(directly measured). Capture an underlying phenomenon (e.g., health).

Big Data PLS Methods December 2018, AASC Rotorua NZ 5/54

Page 12: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Data: Constraints and AimsI Main constraint: colinearity among the variables, or situation with

p > n or q > n.I Two Aims:

1. Symmetric situation. Analyze the association between two blocksof information. Analysis focused on shared information.

2. Asymmetric situation. X matrix= predictors and Y matrix=response variables. Analysis focused on prediction.

I Partial Least Square Family: dimension reduction approaches

I PLS finds pairs of latent vectors ξ = Xu, ω = Yv with maximalcovariance.

e.g., ξ = u1 × SNP1 + u2 × SNP2 + · · ·+ up × SNPp

I Symmetric situation and Asymmetric situation.I Matrix decomposition of X and Y into successive latent variables.

Latent variables: are not directly observed but are rather inferred(through a mathematical model) from other variables that are observed(directly measured). Capture an underlying phenomenon (e.g., health).

Big Data PLS Methods December 2018, AASC Rotorua NZ 5/54

Page 13: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Data: Constraints and AimsI Main constraint: colinearity among the variables, or situation with

p > n or q > n.I Two Aims:

1. Symmetric situation. Analyze the association between two blocksof information. Analysis focused on shared information.

2. Asymmetric situation. X matrix= predictors and Y matrix=response variables. Analysis focused on prediction.

I Partial Least Square Family: dimension reduction approachesI PLS finds pairs of latent vectors ξ = Xu, ω = Yv with maximal

covariance.

e.g., ξ = u1 × SNP1 + u2 × SNP2 + · · ·+ up × SNPp

I Symmetric situation and Asymmetric situation.I Matrix decomposition of X and Y into successive latent variables.

Latent variables: are not directly observed but are rather inferred(through a mathematical model) from other variables that are observed(directly measured). Capture an underlying phenomenon (e.g., health).

Big Data PLS Methods December 2018, AASC Rotorua NZ 5/54

Page 14: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

PLS and sparse PLS

Classical PLSI Output of PLS: H pairs of latent variables (ξh ,ωh), h = 1, . . . ,H.I Reduction method (H << min(p, q)). But no variable selection for

extracting the most relevant (original) variables from each latentvariable.

sparse PLSI sparse PLS selects the relevant SNPsI Some coefficients u` are equal to 0ξh = u1 ×SNP1 + u2︸︷︷︸

=0

×SNP2 + u3︸︷︷︸=0

×SNP3 + · · ·+up ×SNPp

I The sPLS components are linear combinations of the selectedvariables

Big Data PLS Methods December 2018, AASC Rotorua NZ 6/54

Page 15: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

PLS and sparse PLS

Classical PLSI Output of PLS: H pairs of latent variables (ξh ,ωh), h = 1, . . . ,H.I Reduction method (H << min(p, q)). But no variable selection for

extracting the most relevant (original) variables from each latentvariable.

sparse PLSI sparse PLS selects the relevant SNPsI Some coefficients u` are equal to 0ξh = u1 ×SNP1 + u2︸︷︷︸

=0

×SNP2 + u3︸︷︷︸=0

×SNP3 + · · ·+up ×SNPp

I The sPLS components are linear combinations of the selectedvariables

Big Data PLS Methods December 2018, AASC Rotorua NZ 6/54

Page 16: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Group structures within the dataI Natural example: Categorical variables form a group of dummy variables

in a regression setting.

I Genomics: genes within the same pathway have similar functions andact together in regulating a biological system.→ These genes can add up to have a larger effect→ can be detected as a group (i.e., at a pathway or gene set/modulelevel).

We consider that variables are divided into groups:

I Example: p SNPs grouped into K genes (Xj = SNPj)

X =[SNP1, . . . ,SNPk︸ ︷︷ ︸

gene1

|SNPk+1,SNPk+2, . . . ,SNPh︸ ︷︷ ︸gene2

| . . . |SNPl+1, . . . ,SNPp︸ ︷︷ ︸geneK

]I Example: p genes grouped into K pathways/modules (Xj = genej)

X =[X1,X2, . . . ,Xk︸ ︷︷ ︸

M1

|Xk+1,Xk+2, . . . ,Xh︸ ︷︷ ︸M2

| . . . |Xl+1,Xl+2, . . . ,Xp︸ ︷︷ ︸MK

]

Big Data PLS Methods December 2018, AASC Rotorua NZ 7/54

Page 17: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Group structures within the dataI Natural example: Categorical variables form a group of dummy variables

in a regression setting.

I Genomics: genes within the same pathway have similar functions andact together in regulating a biological system.→ These genes can add up to have a larger effect→ can be detected as a group (i.e., at a pathway or gene set/modulelevel).

We consider that variables are divided into groups:

I Example: p SNPs grouped into K genes (Xj = SNPj)

X =[SNP1, . . . ,SNPk︸ ︷︷ ︸

gene1

|SNPk+1,SNPk+2, . . . ,SNPh︸ ︷︷ ︸gene2

| . . . |SNPl+1, . . . ,SNPp︸ ︷︷ ︸geneK

]I Example: p genes grouped into K pathways/modules (Xj = genej)

X =[X1,X2, . . . ,Xk︸ ︷︷ ︸

M1

|Xk+1,Xk+2, . . . ,Xh︸ ︷︷ ︸M2

| . . . |Xl+1,Xl+2, . . . ,Xp︸ ︷︷ ︸MK

]

Big Data PLS Methods December 2018, AASC Rotorua NZ 7/54

Page 18: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Group structures within the dataI Natural example: Categorical variables form a group of dummy variables

in a regression setting.

I Genomics: genes within the same pathway have similar functions andact together in regulating a biological system.→ These genes can add up to have a larger effect→ can be detected as a group (i.e., at a pathway or gene set/modulelevel).

We consider that variables are divided into groups:

I Example: p SNPs grouped into K genes (Xj = SNPj)

X =[SNP1, . . . ,SNPk︸ ︷︷ ︸

gene1

|SNPk+1,SNPk+2, . . . ,SNPh︸ ︷︷ ︸gene2

| . . . |SNPl+1, . . . ,SNPp︸ ︷︷ ︸geneK

]I Example: p genes grouped into K pathways/modules (Xj = genej)

X =[X1,X2, . . . ,Xk︸ ︷︷ ︸

M1

|Xk+1,Xk+2, . . . ,Xh︸ ︷︷ ︸M2

| . . . |Xl+1,Xl+2, . . . ,Xp︸ ︷︷ ︸MK

]

Big Data PLS Methods December 2018, AASC Rotorua NZ 7/54

Page 19: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Group PLS

Aim: select groups of variables taking into account the data structure

I PLS componentsξh = u1 × X1 + u2 × X2 + u3 × X3 + · · ·+ up × Xp

I sparse PLS components (sPLS)ξh = u1 × X1 + u2︸︷︷︸

=0

×X2 + u3︸︷︷︸=0

×X3 + · · ·+ up × Xp

I group PLS components (gPLS)

ξh =

module1︷ ︸︸ ︷u1︸︷︷︸=0

X1 + u2︸︷︷︸=0

X2 +

module2︷ ︸︸ ︷u3︸︷︷︸,0

X3 + u4︸︷︷︸,0

X1 + u5︸︷︷︸,0

X5 + · · ·+

moduleK︷ ︸︸ ︷up−1︸︷︷︸=0

Xp−1 + up︸︷︷︸=0

Xp

→ select groups of variables; either all the variables within a group are selectedor none of them are selected

... does not achieve sparsity within each group ...

Big Data PLS Methods December 2018, AASC Rotorua NZ 8/54

Page 20: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Group PLS

Aim: select groups of variables taking into account the data structure

I PLS componentsξh = u1 × X1 + u2 × X2 + u3 × X3 + · · ·+ up × Xp

I sparse PLS components (sPLS)ξh = u1 × X1 + u2︸︷︷︸

=0

×X2 + u3︸︷︷︸=0

×X3 + · · ·+ up × Xp

I group PLS components (gPLS)

ξh =

module1︷ ︸︸ ︷u1︸︷︷︸=0

X1 + u2︸︷︷︸=0

X2 +

module2︷ ︸︸ ︷u3︸︷︷︸,0

X3 + u4︸︷︷︸,0

X1 + u5︸︷︷︸,0

X5 + · · ·+

moduleK︷ ︸︸ ︷up−1︸︷︷︸=0

Xp−1 + up︸︷︷︸=0

Xp

→ select groups of variables; either all the variables within a group are selectedor none of them are selected

... does not achieve sparsity within each group ...

Big Data PLS Methods December 2018, AASC Rotorua NZ 8/54

Page 21: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Group PLS

Aim: select groups of variables taking into account the data structure

I PLS componentsξh = u1 × X1 + u2 × X2 + u3 × X3 + · · ·+ up × Xp

I sparse PLS components (sPLS)ξh = u1 × X1 + u2︸︷︷︸

=0

×X2 + u3︸︷︷︸=0

×X3 + · · ·+ up × Xp

I group PLS components (gPLS)

ξh =

module1︷ ︸︸ ︷u1︸︷︷︸=0

X1 + u2︸︷︷︸=0

X2 +

module2︷ ︸︸ ︷u3︸︷︷︸,0

X3 + u4︸︷︷︸,0

X1 + u5︸︷︷︸,0

X5 + · · ·+

moduleK︷ ︸︸ ︷up−1︸︷︷︸=0

Xp−1 + up︸︷︷︸=0

Xp

→ select groups of variables; either all the variables within a group are selectedor none of them are selected

... does not achieve sparsity within each group ...

Big Data PLS Methods December 2018, AASC Rotorua NZ 8/54

Page 22: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Group PLS

Aim: select groups of variables taking into account the data structure

I PLS componentsξh = u1 × X1 + u2 × X2 + u3 × X3 + · · ·+ up × Xp

I sparse PLS components (sPLS)ξh = u1 × X1 + u2︸︷︷︸

=0

×X2 + u3︸︷︷︸=0

×X3 + · · ·+ up × Xp

I group PLS components (gPLS)

ξh =

module1︷ ︸︸ ︷u1︸︷︷︸=0

X1 + u2︸︷︷︸=0

X2 +

module2︷ ︸︸ ︷u3︸︷︷︸,0

X3 + u4︸︷︷︸,0

X1 + u5︸︷︷︸,0

X5 + · · ·+

moduleK︷ ︸︸ ︷up−1︸︷︷︸=0

Xp−1 + up︸︷︷︸=0

Xp

→ select groups of variables; either all the variables within a group are selectedor none of them are selected

... does not achieve sparsity within each group ...

Big Data PLS Methods December 2018, AASC Rotorua NZ 8/54

Page 23: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Sparse Group PLSAim: combine both sparsity of groups and within each group.Example: X matrix = genes. We might be interested in identifying particularlyimportant genes in pathways of interest.I sparse PLS components (sPLS)ξh = u1 × X1 + u2︸︷︷︸

=0

×X2 + u3︸︷︷︸=0

×X3 + · · ·+ up × Xp

I group PLS components (gPLS)

ξh =

module1︷ ︸︸ ︷u1︸︷︷︸=0

X1 + u2︸︷︷︸=0

X2 +

module2︷ ︸︸ ︷u3︸︷︷︸,0

X3 + u4︸︷︷︸,0

X1 + u5︸︷︷︸,0

X5 + · · ·+

moduleK︷ ︸︸ ︷up−1︸︷︷︸=0

Xp−1 + up︸︷︷︸=0

Xp

I sparse group PLS components (sgPLS)

ξh =

module1︷ ︸︸ ︷u1︸︷︷︸=0

X1 + u2︸︷︷︸=0

X2 +

module2︷ ︸︸ ︷u3︸︷︷︸,0

X3 + u4︸︷︷︸=0

X4 + u5︸︷︷︸=0

X5 + · · ·+

moduleK︷ ︸︸ ︷up−1︸︷︷︸=0

Xp−1 + up︸︷︷︸=0

Xp

Big Data PLS Methods December 2018, AASC Rotorua NZ 9/54

Page 24: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Sparse Group PLSAim: combine both sparsity of groups and within each group.Example: X matrix = genes. We might be interested in identifying particularlyimportant genes in pathways of interest.I sparse PLS components (sPLS)ξh = u1 × X1 + u2︸︷︷︸

=0

×X2 + u3︸︷︷︸=0

×X3 + · · ·+ up × Xp

I group PLS components (gPLS)

ξh =

module1︷ ︸︸ ︷u1︸︷︷︸=0

X1 + u2︸︷︷︸=0

X2 +

module2︷ ︸︸ ︷u3︸︷︷︸,0

X3 + u4︸︷︷︸,0

X1 + u5︸︷︷︸,0

X5 + · · ·+

moduleK︷ ︸︸ ︷up−1︸︷︷︸=0

Xp−1 + up︸︷︷︸=0

Xp

I sparse group PLS components (sgPLS)

ξh =

module1︷ ︸︸ ︷u1︸︷︷︸=0

X1 + u2︸︷︷︸=0

X2 +

module2︷ ︸︸ ︷u3︸︷︷︸,0

X3 + u4︸︷︷︸=0

X4 + u5︸︷︷︸=0

X5 + · · ·+

moduleK︷ ︸︸ ︷up−1︸︷︷︸=0

Xp−1 + up︸︷︷︸=0

Xp

Big Data PLS Methods December 2018, AASC Rotorua NZ 9/54

Page 25: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Aims in a regression setting

I Select groups of variables taking into account the data structure;all the variables within a group are selected otherwise none ofthem are selected

I Combine both sparsity of groups and within each group; only rel-evant variables within a group are selected

Big Data PLS Methods December 2018, AASC Rotorua NZ 10/54

Page 26: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Illustration: Dendritic Cells in Addition to AntiretroviralTreatment (DALIA) trial

I Evaluation of the safety and the immunogenicity of a vaccine on n = 19HIV-1 infected patients.

I The vaccine was injected on weeks 0, 4, 8 and 12 while patients re-ceived an antiretroviral therapy. An interruption of the antiretrovirals wasperformed at week 24.

I After vaccination, a deep evaluation of the immune response was per-formed at week 16.

I Repeated measurements of the main immune markers and gene ex-pression were performed every 4 weeks until the end of the trials.

Big Data PLS Methods December 2018, AASC Rotorua NZ 11/54

Page 27: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

DALIA trial: Question ?

First results obtained using group of genesI Significant change of gene expression among 69 modules over

time before antiretroviral treatment interruption.

I How does the gene abundance of these 69 modules as measuredat week 16 correlate with immune markers measured at week16?

Big Data PLS Methods December 2018, AASC Rotorua NZ 12/54

Page 28: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

DALIA trial: Question ?

First results obtained using group of genesI Significant change of gene expression among 69 modules over

time before antiretroviral treatment interruption.I How does the gene abundance of these 69 modules as measured

at week 16 correlate with immune markers measured at week16?

Big Data PLS Methods December 2018, AASC Rotorua NZ 12/54

Page 29: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

sPLS, gPLS and sgPLS

I Response variables Y= immune markers composed of q = 7 cy-tokines (IL21, IL2, IL13, IFNg, Luminex score, TH1 score, CD4).

I Predictor variables X= expression of p = 5399 genes extractedfrom the 69 modules.

I Use the structure of the data (modules) for gPLS and sgPLS.Each gene belongs to one of the 69 modules.

I Asymmetric situation.

Big Data PLS Methods December 2018, AASC Rotorua NZ 13/54

Page 30: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Results: Modules and number of genes selected

p = 5399 ; 24 modules selected by gPLS or sgPLS on 3 scoresBig Data PLS Methods December 2018, AASC Rotorua NZ 14/54

Page 31: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Results: Modules and number of genes selected

Big Data PLS Methods December 2018, AASC Rotorua NZ 15/54

Page 32: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Results: Venn diagram

I sgPLS selects slightly more genes than sPLS (respectively 487 and 420 genes selected)I But sgPLS selects fewer modules than sPLS (respectively 21 and 64 groups of genes

selected)I Note: all the 21 groups of genes selected by sgPLS were included in those selected by

sPLS.I sgPLS selects slightly more modules than gPLS (4 more, 14/21 in common).I However, gPLS leads to more genes selected than sgPLS (944)I In this application, the sgPLS approach led to a parsimonious selection of modules and

genes that sound very relevant biologicallyChaussabel’s functional modules: http://www.biir.net/public wikis/module annotation/V2 Trial 8 Modules

Big Data PLS Methods December 2018, AASC Rotorua NZ 16/54

Page 33: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Results: Venn diagram

I sgPLS selects slightly more genes than sPLS (respectively 487 and 420 genes selected)I But sgPLS selects fewer modules than sPLS (respectively 21 and 64 groups of genes

selected)I Note: all the 21 groups of genes selected by sgPLS were included in those selected by

sPLS.I sgPLS selects slightly more modules than gPLS (4 more, 14/21 in common).

I However, gPLS leads to more genes selected than sgPLS (944)I In this application, the sgPLS approach led to a parsimonious selection of modules and

genes that sound very relevant biologicallyChaussabel’s functional modules: http://www.biir.net/public wikis/module annotation/V2 Trial 8 Modules

Big Data PLS Methods December 2018, AASC Rotorua NZ 16/54

Page 34: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Results: Venn diagram

I sgPLS selects slightly more genes than sPLS (respectively 487 and 420 genes selected)I But sgPLS selects fewer modules than sPLS (respectively 21 and 64 groups of genes

selected)I Note: all the 21 groups of genes selected by sgPLS were included in those selected by

sPLS.I sgPLS selects slightly more modules than gPLS (4 more, 14/21 in common).I However, gPLS leads to more genes selected than sgPLS (944)I In this application, the sgPLS approach led to a parsimonious selection of modules and

genes that sound very relevant biologicallyChaussabel’s functional modules: http://www.biir.net/public wikis/module annotation/V2 Trial 8 Modules

Big Data PLS Methods December 2018, AASC Rotorua NZ 16/54

Page 35: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Stability of the variable selection (100 bootstrap samples)

SelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelected Not SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot Selected

0.0

0.2

0.4

0.6

0.8

1.0

M7.

1M

5.1

M3.

2M

5.15

M4.

2M

4.13

M4.

15M

4.6

M6.

6M

7.27

M6.

13M

5.7

M5.

14M

1.1

M3.

6M

3.5

M4.

1M

5.5

M5.

3M

4.7

M5.

11M

2.3

M6.

14M

3.1

M7.

2M

4.4

M7.

35M

5.4

M5.

8M

7.5

M7.

12M

7.15

M4.

16M

7.8

M7.

25M

6.4

M6.

10M

6.7

M7.

6M

7.11

M7.

33M

8.14

M4.

3M

5.10

M5.

6M

5.9

M7.

16M

7.7

M4.

12M

4.14

M4.

8M

7.24

M6.

20M

7.21

M5.

13M

4.9

M6.

2M

2.1

M7.

4M

7.14

M7.

26M

4.5

M7.

28M

8.35

M5.

2M

8.13

M4.

11M

6.9

M8.

59

Module

Freq

uenc

y

Apoptosis / SurvivalApoptosis / SurvivalCell CycleCell DeathCytotoxic/NK CellErythrocytesInflammationMitochondrial RespirationMitochondrial Stress / ProteasomeMonocytesNeutrophilsPlasma CellsPlateletsProtein SynthesisT cellT cellsUndetermined

sgPLS − component 1

SelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelected Not SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot Selected

0.0

0.2

0.4

0.6

0.8

1.0

M7.

1M

5.5

M5.

1M

5.13

M3.

5M

5.10

M5.

7M

7.5

M4.

7M

7.12

M5.

4M

7.4

M7.

11M

7.25

M6.

2M

6.7

M5.

9M

5.6

M7.

8M

4.9

M1.

1M

7.6

M4.

6M

6.4

M7.

2M

7.21

M5.

3M

7.16

M5.

8M

6.20

M6.

10M

5.11

M4.

12M

4.16

M6.

9M

3.2

M7.

15M

4.15

M2.

1M

7.14

M6.

14M

3.1

M4.

1M

7.26

M7.

24M

4.2

M6.

6M

4.8

M6.

13M

4.13

M5.

14M

7.27

M8.

14M

3.6

M7.

35M

8.13

M5.

15M

7.7

M7.

28M

5.2

M4.

5M

4.4

M7.

33M

2.3

M4.

3M

4.14

M8.

59M

8.35

M4.

11

Module

Freq

uenc

y

Apoptosis / SurvivalApoptosis / SurvivalCell CycleCell DeathCytotoxic/NK CellErythrocytesInflammationMitochondrial RespirationMitochondrial Stress / ProteasomeMonocytesNeutrophilsPlasma CellsPlateletsProtein SynthesisT cellT cellsUndetermined

sPLS − component 1

SelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelected Not SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot Selected

0.0

0.2

0.4

0.6

0.8

1.0

M5.

15M

4.2

M6.

6M

3.2

M7.

27M

4.13

M4.

15M

7.1

M6.

13M

3.6

M7.

35M

4.6

M1.

1M

4.8

M5.

14M

6.14

M2.

3M

3.1

M4.

4M

5.1

M4.

7M

4.1

M2.

1M

6.10

M5.

3M

5.11

M7.

8M

6.9

M5.

5M

4.3

M7.

33M

8.14

M4.

14M

6.20

M4.

11M

6.7

M8.

59M

4.12

M7.

14M

6.4

M7.

15M

4.16

M7.

24M

7.25

M5.

7M

5.9

M3.

5M

6.2

M7.

2M

7.6

M8.

13M

8.35

M5.

8M

7.16

M4.

5M

4.9

M5.

2M

7.5

M7.

7M

5.10

M5.

4M

7.11

M7.

4M

5.13

M5.

6M

7.12

M7.

21M

7.26

M7.

28

Module

Freq

uenc

y

Apoptosis / SurvivalApoptosis / SurvivalCell CycleCell DeathCytotoxic/NK CellErythrocytesInflammationMitochondrial RespirationMitochondrial Stress / ProteasomeMonocytesNeutrophilsPlasma CellsPlateletsProtein SynthesisT cellT cellsUndetermined

gPLS − component 1

SelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelectedSelected Not SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot SelectedNot Selected

0.0

0.2

0.4

0.6

0.8

1.0

M7.

1M

5.1

M3.

2M

5.15

M4.

2M

4.13

M4.

15M

4.6

M6.

6M

7.27

M6.

13M

5.7

M5.

14M

1.1

M3.

6M

3.5

M4.

1M

5.5

M5.

3M

4.7

M5.

11M

2.3

M6.

14M

3.1

M7.

2M

4.4

M7.

35M

5.4

M5.

8M

7.5

M7.

12M

7.15

M4.

16M

7.8

M7.

25M

6.4

M6.

10M

6.7

M7.

6M

7.11

M7.

33M

8.14

M4.

3M

5.10

M5.

6M

5.9

M7.

16M

7.7

M4.

12M

4.14

M4.

8M

7.24

M6.

20M

7.21

M5.

13M

4.9

M6.

2M

2.1

M7.

4M

7.14

M7.

26M

4.5

M7.

28M

8.35

M5.

2M

8.13

M4.

11M

6.9

M8.

59

Module

Freq

uenc

y

Apoptosis / SurvivalApoptosis / SurvivalCell CycleCell DeathCytotoxic/NK CellErythrocytesInflammationMitochondrial RespirationMitochondrial Stress / ProteasomeMonocytesNeutrophilsPlasma CellsPlateletsProtein SynthesisT cellT cellsUndetermined

sgPLS − component 1

Stability of the variable selection assessed on 100 bootstrap sampleson DALIA-1 trial data, for the gPLS, sgPLS and sPLS procedures re-spectively. For each procedure, the modules selected on the originalsample are separated from those that were not.

Big Data PLS Methods December 2018, AASC Rotorua NZ 17/54

Page 36: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Now some mathematics ...

Big Data PLS Methods December 2018, AASC Rotorua NZ 18/54

Page 37: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

PLS family

PLS = Partial Least Squares or Projection to Latent Structures

Four main methods coexist in the literature:

(i) Partial Least Squares Correlation (PLSC) also called PLS-SVD;

(ii) PLS in mode A (PLS-W2A, for Wold’s Two-Block, Mode A PLS);

(iii) PLS in mode B (PLS-W2B) also called Canonical CorrelationAnalysis (CCA);

(iv) Partial Least Squares Regression (PLSR, or PLS2).

I (i),(ii) and (iii) are symmetric while (iv) is asymmetric.

I Different objective functions to optimise.

I Good news: all use the singular value decomposition (SVD).

Big Data PLS Methods December 2018, AASC Rotorua NZ 19/54

Page 38: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

PLS family

PLS = Partial Least Squares or Projection to Latent Structures

Four main methods coexist in the literature:

(i) Partial Least Squares Correlation (PLSC) also called PLS-SVD;

(ii) PLS in mode A (PLS-W2A, for Wold’s Two-Block, Mode A PLS);

(iii) PLS in mode B (PLS-W2B) also called Canonical CorrelationAnalysis (CCA);

(iv) Partial Least Squares Regression (PLSR, or PLS2).

I (i),(ii) and (iii) are symmetric while (iv) is asymmetric.

I Different objective functions to optimise.

I Good news: all use the singular value decomposition (SVD).

Big Data PLS Methods December 2018, AASC Rotorua NZ 19/54

Page 39: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Singular Value Decomposition (SVD)

Definition 1Let a matrixM : p × q of rank r :

M =U∆VT =r∑

l=1

δlulvTl , (1)

I U = (ul) : p × p and V = (v l) : q × q are two orthogonal matriceswhich contain the normalised left (resp. right) singular vectors

I ∆ = diag(δ1, . . . , δr , 0, . . . , 0): the ordered singular values δ1 > δ2 >· · · > δr > 0.

Note: fast and efficient algorithms exist to solve the SVD.

Big Data PLS Methods December 2018, AASC Rotorua NZ 20/54

Page 40: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Connexion between SVD and maximum covariance

We were able to describe the optimization problem of the four PLSmethods as:

(u∗, v∗) = argmax‖u‖2=‖v‖2=1

Cov(Xh−1u,Yh−1v), h = 1, . . . ,H.

Matrices Xh and Yh are obtained recursively from Xh−1 and Yh−1.

The four methods differ by the deflation process, chosen so that theabove scores or weight vectors satisfy given constraints.

The solution at step h is obtained by computing only the first triplet(δ1,u1, v1) of singular elements of the SVD ofMh−1 = XT

h−1Yh−1:

(u∗, v∗) = (u1, v1)

Why is this useful ?

Big Data PLS Methods December 2018, AASC Rotorua NZ 21/54

Page 41: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Connexion between SVD and maximum covariance

We were able to describe the optimization problem of the four PLSmethods as:

(u∗, v∗) = argmax‖u‖2=‖v‖2=1

Cov(Xh−1u,Yh−1v), h = 1, . . . ,H.

Matrices Xh and Yh are obtained recursively from Xh−1 and Yh−1.

The four methods differ by the deflation process, chosen so that theabove scores or weight vectors satisfy given constraints.

The solution at step h is obtained by computing only the first triplet(δ1,u1, v1) of singular elements of the SVD ofMh−1 = XT

h−1Yh−1:

(u∗, v∗) = (u1, v1)

Why is this useful ?

Big Data PLS Methods December 2018, AASC Rotorua NZ 21/54

Page 42: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Connexion between SVD and maximum covariance

We were able to describe the optimization problem of the four PLSmethods as:

(u∗, v∗) = argmax‖u‖2=‖v‖2=1

Cov(Xh−1u,Yh−1v), h = 1, . . . ,H.

Matrices Xh and Yh are obtained recursively from Xh−1 and Yh−1.

The four methods differ by the deflation process, chosen so that theabove scores or weight vectors satisfy given constraints.

The solution at step h is obtained by computing only the first triplet(δ1,u1, v1) of singular elements of the SVD ofMh−1 = XT

h−1Yh−1:

(u∗, v∗) = (u1, v1)

Why is this useful ?

Big Data PLS Methods December 2018, AASC Rotorua NZ 21/54

Page 43: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Connexion between SVD and maximum covariance

We were able to describe the optimization problem of the four PLSmethods as:

(u∗, v∗) = argmax‖u‖2=‖v‖2=1

Cov(Xh−1u,Yh−1v), h = 1, . . . ,H.

Matrices Xh and Yh are obtained recursively from Xh−1 and Yh−1.

The four methods differ by the deflation process, chosen so that theabove scores or weight vectors satisfy given constraints.

The solution at step h is obtained by computing only the first triplet(δ1,u1, v1) of singular elements of the SVD ofMh−1 = XT

h−1Yh−1:

(u∗, v∗) = (u1, v1)

Why is this useful ?Big Data PLS Methods December 2018, AASC Rotorua NZ 21/54

Page 44: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

SVD properties

Theorem 2Eckart-Young (1936) states that the (truncated) SVD of a givenmatrixM (of rank r) provides the best reconstitution (in a leastsquares sense) ofM by a matrix with a lower rank k :

minA of rank k

‖M −A‖2F =

∥∥∥∥∥∥∥M −k∑`=1

δ`u`vT`

∥∥∥∥∥∥∥2

F

=r∑

`=k+1

δ2` .

If the minimum is searched for matrices A of rank 1, which are underthe form uvT where u, v are non-zero vectors, we obtain

minu,v

∥∥∥∥M − uvT∥∥∥∥2

F=

r∑`=2

δ2` =

∥∥∥M − δ1u1vT1

∥∥∥2F .

Big Data PLS Methods December 2018, AASC Rotorua NZ 22/54

Page 45: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

SVD properties

Thus, solving

argminu,v

∥∥∥∥Mh−1 − uvT∥∥∥∥2

F(2)

and norming the resulting vectors gives us u1 and v1. This is an-other approach to solve the PLS optimization problem.

Big Data PLS Methods December 2018, AASC Rotorua NZ 23/54

Page 46: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Towards sparse PLSI Shen and Huang (2008) connected (2) (in a PCA context) to least square

minimisation in regression:

∥∥∥∥Mh−1 − uvT∥∥∥∥2

F=

∥∥∥∥∥∥∥∥∥∥vec(Mh−1)︸ ︷︷ ︸y

− (Ip ⊗ u)v︸ ︷︷ ︸Xβ

∥∥∥∥∥∥∥∥∥∥2

2

=

∥∥∥∥∥∥∥∥∥∥vec(Mh−1)︸ ︷︷ ︸y

− (v ⊗ Iq)u︸ ︷︷ ︸Xβ

∥∥∥∥∥∥∥∥∥∥2

2

.

→ Possible to use many existing variable selection techniques usingregularization penalties.

We propose iterative alternating algorithms to find normed vectorsu/‖u‖ and v/‖v‖ that minimise the following penalised sum-of-squarescriterion ∥∥∥∥Mh−1 − uvT

∥∥∥∥2

F+ Pλ(u, v),

for various penalization terms Pλ(u, v).

→ We obtain several sparse versions (in terms of the weights u andv) of the four methods (i)–(iv).

Big Data PLS Methods December 2018, AASC Rotorua NZ 24/54

Page 47: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Towards sparse PLSI Shen and Huang (2008) connected (2) (in a PCA context) to least square

minimisation in regression:

∥∥∥∥Mh−1 − uvT∥∥∥∥2

F=

∥∥∥∥∥∥∥∥∥∥vec(Mh−1)︸ ︷︷ ︸y

− (Ip ⊗ u)v︸ ︷︷ ︸Xβ

∥∥∥∥∥∥∥∥∥∥2

2

=

∥∥∥∥∥∥∥∥∥∥vec(Mh−1)︸ ︷︷ ︸y

− (v ⊗ Iq)u︸ ︷︷ ︸Xβ

∥∥∥∥∥∥∥∥∥∥2

2

.

→ Possible to use many existing variable selection techniques usingregularization penalties.

We propose iterative alternating algorithms to find normed vectorsu/‖u‖ and v/‖v‖ that minimise the following penalised sum-of-squarescriterion ∥∥∥∥Mh−1 − uvT

∥∥∥∥2

F+ Pλ(u, v),

for various penalization terms Pλ(u, v).

→ We obtain several sparse versions (in terms of the weights u andv) of the four methods (i)–(iv).

Big Data PLS Methods December 2018, AASC Rotorua NZ 24/54

Page 48: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Sparse PLS models

For cases (i)–(iv),

I Aim: obtaining sparse weight vectors uh and vh .

I Associated component scores (i.e., latent variables) ξh := Xh−1uh andωh := Yh−1vh , h = 1, . . . ,H, for a small number of components.

I Recursive procedure with objective function involving Xh−1 and Yh−1

→ decomposition (approximation) of the original matrices X and Y:

X = ΞHCTH + FX ,H , Y = ΩHD

TH + FY ,H , (3)

where Ξ = (ξh) and Ω = (ωh).

I For the regression mode, we have the multivariate linear regressionmodel

Y = XBPLS + E,

with BPLS =UH(CTHUH)

−1PHDTH and E is a matrix of residuals.

Big Data PLS Methods December 2018, AASC Rotorua NZ 25/54

Page 49: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

The algorithmMain steps of the iterative algorithm

1. X0 = X, Y0 = Y h = 1

2. Mh−1 := XTh−1Yh−1.

3. SVD: extraction of the first pair of singular vectors uh and vh .

4. Sparsity step. Produces sparse weights usparse and vsparse.

5. Latent variables: ξh = Xh−1usparse and ωh = Yh−1vsparse

6. Slope coefficients:I ch = XT

h−1ξh/ξThξh for both modes

I dh = YTh−1ξh/ξ

Thξh for “PLSR regression mode”

I eh = YTh−1ωh/ω

Thωh for “PLS mode A”

7. Deflation:I Xh = Xh−1 − ξhcT

h for both modesI Yh = Yh−1 − ξhdT

h for “PLSR regression mode”I Yh = Yh−1 − ωheT

h for “PLS mode A”

8. If h = H stop, else h = h + 1 and goto step 2.

Big Data PLS Methods December 2018, AASC Rotorua NZ 26/54

Page 50: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

sparse PLS (sPLS)

In sPLS, the optimisation problem to solve is

minuh ,vh

∥∥∥Mh − uhvTh

∥∥∥2

F+ Pλ1,h (uh) + Pλ2,h (vh),

I ‖Mh − uhvTh ‖

2F =

∑pi=1

∑qj=1(mij − uihvjh)

2,

I Mh = XTh Yh for each iteration h.

I Pλ1,h (uh) =∑p

i=1 2λh1 |ui | and Pλ2,h (vh) =

∑qj=1 2λh

2 |vi |

Iterative solution. Applying the thresholding function gsoft(x, λ) = sign(x)(|x | − λ)+I to the vectorMvh componentwise to get uh .

I to the vectorMTuh componentwise to get vh .

Big Data PLS Methods December 2018, AASC Rotorua NZ 27/54

Page 51: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

sparse PLS (sPLS)

In sPLS, the optimisation problem to solve is

minuh ,vh

∥∥∥Mh − uhvTh

∥∥∥2

F+ Pλ1,h (uh) + Pλ2,h (vh),

I ‖Mh − uhvTh ‖

2F =

∑pi=1

∑qj=1(mij − uihvjh)

2,

I Mh = XTh Yh for each iteration h.

I Pλ1,h (uh) =∑p

i=1 2λh1 |ui | and Pλ2,h (vh) =

∑qj=1 2λh

2 |vi |

Iterative solution. Applying the thresholding function gsoft(x, λ) = sign(x)(|x | − λ)+I to the vectorMvh componentwise to get uh .

I to the vectorMTuh componentwise to get vh .

Big Data PLS Methods December 2018, AASC Rotorua NZ 27/54

Page 52: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

group PLS (gPLS)I X and Y can be divided respectively into K and L sub-matrices (groups) X(k) :

n × pk and Y(l) : n × ql .I Same idea as Yuan and Lin (2006), we use group lasso penalties:

Pλ1(u) = λ1

K∑k=1

√pk ‖u(k)‖2 and Pλ2(v) = λ2

L∑l=1

√ql‖v(l)‖2,

where u(k) (resp. v(l)) is the weight vector associated to the k -th (resp. l-th)block.

In gPLS, the optimisation problem to solve is

K∑k=1

L∑l=1

∥∥∥∥M(k ,l)− u(k)v(l)T

∥∥∥∥2

F+ Pλ1(u) + Pλ2(v),

I M(k ,l) = X(k)Y(l)T .

Remark if the k -th block is composed by only one variable then

‖u(k)‖2 =√(u(k))2 = |u(k)|.

Big Data PLS Methods December 2018, AASC Rotorua NZ 28/54

Page 53: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

group PLS (gPLS)

Theorem 3Solution of the group PLS optimisation problem is given by:

u(k) =

(1 −

λ1

2

√pk

||M(k ,·)v ||2

)+

M(k ,·)v (for fixed v)

and

v(l) =

1 − λ2

2

√ql

||M(·,l)Tu||2

+

M(·,l)Tu (for fixed u).

Note: we will iterate until convergence of u(k) and v(l), using alterna-tively one of the above formulas.

Big Data PLS Methods December 2018, AASC Rotorua NZ 29/54

Page 54: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

sparse group PLS: sparsity within groups

I Following Simon et al. (2013), we introduce sparse group lasso penalties:

Pλ1(u) = (1 − α1)λ1

K∑k=1

√pk ‖u(k)‖2 + α1λ1‖u‖1,

Pλ2(v) = (1 − α2)λ2

L∑l=1

√ql‖v(l)‖2 + α2λ2‖v‖1.

Big Data PLS Methods December 2018, AASC Rotorua NZ 30/54

Page 55: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

sparse group PLS (sgPLS)

Theorem 4Solution of the sparse group PLS optimisation problem is given by:u(k) = 0 if ∥∥∥∥gsoft

(M

(k ,·)v, λ1α1/2)∥∥∥∥26 λ1(1 − α1)

√pk ,

otherwise

u(k) =12

gsoft(M

(k ,·)v, λ1α1/2)− λ1(1 − α1)

√pk

gsoft(M(k ,·)v, λ1α1/2

)‖gsoft

(M(k ,·)v, λ1α1/2

)‖2

.We have v(l) = 0 if ∥∥∥∥∥gsoft

(M

(·,l)Tu, λ2α2/2)∥∥∥∥∥26 λ2(1 − α2)

√ql

and

v(l) =12

gsoft(M

(·,l)Tu, λ1α1/2)− λ2(1 − α2)

√ql

gsoft(M(·,l)Tu, λ2α2/2

)‖gsoft

(M(·,l)Tu, λ2α2/2

)‖2

otherwise.

Similar proof (see our paper in Bioinformatics, 2016).

Big Data PLS Methods December 2018, AASC Rotorua NZ 31/54

Page 56: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

R package: sgPLS

I sgPLS package implements sPLS, gPLS and sgPLS methods:http://cran.r-project.org/web/packages/sgPLS/index.html

I Includes some functions for choosing the tuning parameters related tothe predictor matrix for different sparse PLS model (regression mode).

I Some simple code to perform a sgPLS:

model.sgPLS <- sgPLS(X, Y, ncomp = 2, mode = "regression",keepX = c(4, 4), keepY = c(4, 4),ind.block.x = ind.block.x ,ind.block.y = ind.block.y,alpha.x = c(0.5, 0.5),alpha.y = c(0.5, 0.5))

I Last version also includes sparse group Discriminant Analysis.

Big Data PLS Methods December 2018, AASC Rotorua NZ 32/54

Page 57: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Extension of sparse group PLS

Taking into account one more layer in the group structure:

I Example: SNP ⊂ Gene ⊂ Pathways

I Longitudinal study

Big Data PLS Methods December 2018, AASC Rotorua NZ 33/54

Page 58: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Extension of sparse group PLS

Taking into account one more layer in the group structure:

I Example: SNP ⊂ Gene ⊂ PathwaysI Longitudinal study

Big Data PLS Methods December 2018, AASC Rotorua NZ 33/54

Page 59: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Group structures within the data

I Gene Module: genes within the same pathway have similar func-tions and act together to regulate the biological system.

X = [gene1, . . . , genekG1

| · · · | genel+1, . . . , genepG4

]

Big Data PLS Methods December 2018, AASC Rotorua NZ 34/54

Page 60: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Longitudinal group structures:

I Time index: genes within the same pathway at the same timeindex have similar functions in regulating a biological system.2

G1 = [gene1, . . . , genekG1T1

| gene1, . . . , genekG1T2

]

Big Data PLS Methods December 2018, AASC Rotorua NZ 35/54

Page 61: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Longitudinal group structures:

X = [G1T1,G1T2G1

| G2T1,G2T2G2

| · · · | G4T1,G4T2G4

]

Big Data PLS Methods December 2018, AASC Rotorua NZ 36/54

Page 62: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Aims:

I Identify important modules at a group level, important times at asubgroup level and single genes at an individual level.

Big Data PLS Methods December 2018, AASC Rotorua NZ 37/54

Page 63: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Sparse Models: sgsPLS

sparse group subgroup PLS

ξ =

Time 1︷ ︸︸ ︷0 × X1 + 0 × X2 +

Time 2︷ ︸︸ ︷0 × X1 + 0 × X2︸ ︷︷ ︸

Module 1

+ · · ·

+

Time 1︷ ︸︸ ︷up−1 × Xp−1 + 0 × Xp +

Time 2︷ ︸︸ ︷0 × Xp−1 + 0 × Xp︸ ︷︷ ︸

Module k

Big Data PLS Methods December 2018, AASC Rotorua NZ 38/54

Page 64: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Sparse Models: sgsPLS

Optimisation of the weights

I X-score ξh = Xh−1uh , Y-score ωh = Yh−1vh

maxvh , uh

Cov(Xu,Yv) − λ1

K∑k=1

‖u(k)‖2 − λ2

K∑k=1

Ak∑a=1

‖u(k ,a)‖2 − λ3‖u‖1

such that vTh vh 6 1 and uT

h uh 6 1.

Big Data PLS Methods December 2018, AASC Rotorua NZ 39/54

Page 65: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

DALIA application

Big Data PLS Methods December 2018, AASC Rotorua NZ 40/54

Page 66: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Preliminary results – selected variables

I 19 modules, 784 genes total of 1452 selected variables.

Big Data PLS Methods December 2018, AASC Rotorua NZ 41/54

Page 67: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

R Package

sgsPLS Available now on GITHUB

library(devtools)install_github("sgspls", "matt-sutton")

Big Data PLS Methods December 2018, AASC Rotorua NZ 42/54

Page 68: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Regularized PLS scalable for BIG-DATA

What happens in a MASSIVE DATA SET context?

Massive datasets. The size of the data is large and analysing it takesa significant amount of time and computer memory.

Emerson & Kane (2012). Dataset considered large if it exceeds 20% ofthe RAM (Random Access Memory) on a given machine, and massiveif it exceeds 50%

Big Data PLS Methods December 2018, AASC Rotorua NZ 43/54

Page 69: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Regularized PLS scalable for BIG-DATA

What happens in a MASSIVE DATA SET context?

Massive datasets. The size of the data is large and analysing it takesa significant amount of time and computer memory.

Emerson & Kane (2012). Dataset considered large if it exceeds 20% ofthe RAM (Random Access Memory) on a given machine, and massiveif it exceeds 50%

Big Data PLS Methods December 2018, AASC Rotorua NZ 43/54

Page 70: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Case of a lot of observations: two massive data sets X: n × p matrixand Y: n × q matrix due to a large number of observations.

We suppose here that n is very large, but not p nor q.

PLS algorithm mainly based on the SVD ofMh−1 = XTh−1Yh−1:

Dimension ofMh−1: p × q matrix !!

This matrix fits into memory.

But not X nor Y.

Big Data PLS Methods December 2018, AASC Rotorua NZ 44/54

Page 71: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Case of a lot of observations: two massive data sets X: n × p matrixand Y: n × q matrix due to a large number of observations.

We suppose here that n is very large, but not p nor q.

PLS algorithm mainly based on the SVD ofMh−1 = XTh−1Yh−1:

Dimension ofMh−1: p × q matrix !!

This matrix fits into memory.

But not X nor Y.

Big Data PLS Methods December 2018, AASC Rotorua NZ 44/54

Page 72: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Case of a lot of observations: two massive data sets X: n × p matrixand Y: n × q matrix due to a large number of observations.

We suppose here that n is very large, but not p nor q.

PLS algorithm mainly based on the SVD ofMh−1 = XTh−1Yh−1:

Dimension ofMh−1: p × q matrix !!

This matrix fits into memory.

But not X nor Y.

Big Data PLS Methods December 2018, AASC Rotorua NZ 44/54

Page 73: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Computation ofM = XTY by chunks

M = XTY =G∑

g=1

XT(g)Y(g)

All terms fit (successively) into memory!

Big Data PLS Methods December 2018, AASC Rotorua NZ 45/54

Page 74: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Computation ofM = XTY by chunks using R

I No need to load the big matrices X and YI Use memory-mapped files (called “filebacking”) through the big-

memory package to allow matrices to exceed the RAM size.I A big.matrix is created which supports the use of shared memory

for efficiency in parallel computing.I foreach: package for running in parallel the computation ofM by

chunks

Regularized PLS algorithm:I Computation of the components (“Scores”):

Xu (n × 1) and Yv (n × 1)

I Easy to compute by chunks and store in a big.matrix object.

Big Data PLS Methods December 2018, AASC Rotorua NZ 46/54

Page 75: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Computation ofM = XTY by chunks using R

I No need to load the big matrices X and YI Use memory-mapped files (called “filebacking”) through the big-

memory package to allow matrices to exceed the RAM size.I A big.matrix is created which supports the use of shared memory

for efficiency in parallel computing.I foreach: package for running in parallel the computation ofM by

chunks

Regularized PLS algorithm:I Computation of the components (“Scores”):

Xu (n × 1) and Yv (n × 1)

I Easy to compute by chunks and store in a big.matrix object.

Big Data PLS Methods December 2018, AASC Rotorua NZ 46/54

Page 76: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Illustration of group PLS with Big-Data

I Simulated: X (5GB) and Y (5GB);I n = 560, 000 observations, p = 400 and q = 500;I Linked by two latent variables, made up of sparse linear combina-

tions of the original variables;I Both X and Y have a group structure: 20 groups of 20 variables

for X and 25 groups of 20 variables for Y;I Only 4 groups in each data set are relevant, 5 variables in each of

these groups are not relevant.

Big Data PLS Methods December 2018, AASC Rotorua NZ 47/54

Page 77: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Figure 1: Comparison of gPLS and BIG-gPLS (for small n = 1, 000)

Big Data PLS Methods December 2018, AASC Rotorua NZ 48/54

Page 78: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Figure 2: Use of BIG-gPLS. Left: small n. Right: Large n.Blue: truth. Red: Recovered.

Big Data PLS Methods December 2018, AASC Rotorua NZ 49/54

Page 79: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Regularised PLS Discriminant Analysis

Categorical response variable becomes a dummy matrix in PLS algo-rithms:

Big Data PLS Methods December 2018, AASC Rotorua NZ 50/54

Page 80: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

Concluding Remarks and Take Home Message

I We were able to derive a simple unified algorithm that perfomsstandard, sparse, group and sparse group versions of the fourclassical PLS algorithms (i)–(iv). (And also PLSDA.)

I We used big memory objects, and a simple trick that makes ourprocedure scalable to big data (large n).

I We also parallelized the code for faster computation.

I bigsgPLS Available now on GITHUB:library(devtools)install_github("bigsgPLS", "matt-sutton")

I We have also offered a version of this algorithm for any combina-tion of large values of n, p and q.

Big Data PLS Methods December 2018, AASC Rotorua NZ 51/54

Page 81: Statistical Methods for the Analysis of High …...2018/12/02  · Statistical Methods for the Analysis of High-Dimensional and Massive Data using R Benoit Liquet1;2, Matthew Sutton2,

ReferencesI Yuan M. and Lin Y. (2006) Model Selection and Estimation in Re-

gression with Grouped Variables. Journal of the Royal Statisti-cal Society: Series B (Statistical Methodology), 68 (1), 49–67.

I Simon N., Friedman J., Hastie T. and Tibshirani R. (2013) A Sparse-group Lasso. Journal of Computational and Graphical Statis-tics, 22 (2), 231–245.

I Liquet B., Lafaye de Micheaux P., Hejblum B. and Thiebaut R.,(2016) Group and Sparse Group Partial Least Square ApproachesApplied in Genomics Context. Bioinformatics, 32(1), 35–42.

I Lafaye de Micheaux P., Liquet B. and Sutton M., PLS for Big Data:A Unified Parallel Algorithm for Regularized Group PLS. (Submit-ted) https://arxiv.org/abs/1702.07066

I M. Sutton, R. Thiebaut, and B. Liquet. (2018) Sparse group sub-group Partial Least Squares with application to genomics data.Statistics in Medicine.

Big Data PLS Methods December 2018, AASC Rotorua NZ 52/54