high-dimensional feature selection in precision medicine · 2017. 4. 25. ·...

High-dimensional feature selection inprecision medicineChloé-Agathe Azencott

Center for Computational Biology (CBIO)Mines ParisTech – Institut Curie – INSERM U900

PSL Research University, Paris, France

April 25, 2017 – ICLRhttp://cazencott.info chloe-agathe.azencott@mines-paristech.fr @cazencott

http://cazencott.infochloe-agathe.azencott@mines-paristech.frhttp://twitter.com/cazencott

Precision MedicineI Adapt treatment to the (genetic) specificities of the patient.E.g. Trastuzumab for HER2+ breast cancer.

I Data-driven biology/medicineIdentify similarities between patients that exhibit similar phenotypes.

Data + Feature Selection

1

Sequencing costs

2

Big data!

3

Big data!

4

GWAS: Genome-Wide Association Studies

Which genomic features explain the phenotype?

p = 105 – 107 Single Nucleotide Polymorphisms (SNPs)n = 102 – 104 samples

High-dimensional (large p)Low sample size (small n)

6

Missing heritabilityGWAS fail to explain most of the inheritable variability ofcomplex traits.

Many possible reasons:– non-genetic / non-SNP factors– heterogeneity of the phenotype– rare SNPs– weak effect sizes– few samples in high dimension (p� n)– joint effets of multiple SNPs.

7

Is extracting knowledgefrom such data doomed

from the start? ???8

Reducing p

9

Integrating prior knowledge

Use prior knowledge as a constraint on the selected features

Prior knowledge can be represented as structure:– Linear structure of DNA– Groups: e.g. pathways– Networks (molecular, 3D structure).

Elephant image by Danny Chapman @ Flickr.

10

Regularized relevanceSet V of p variables.

I Relevance scoreR : 2V → RQuantifies the importance of any subset of variables for the questionunder consideration.Ex : correlation, HSIC, statistical test of association.

I Structured regularizerΩ : 2V → RPromotes a sparsity pattern that is compatible with the constraint on thefeature space.Ex : cardinality Ω : S 7→ |S|.

I Regularized relevancearg maxS⊆V

R(S)− λΩ(S)

11

Network-guided multi-locus GWAS

Goal: Find a set of explanatory SNPs compatible with a givennetwork structure.

12

Network-guided GWASI Additive test of association SKAT [Wu et al. 2011]

R(S) =∑i∈S

ci ci = (X>(y − µ))2i

I Sparse Laplacian regularization

Ω : S 7→∑i∈S

∑j /∈S

Wij + α|S|

I Regularized maximization ofR

arg maxS⊆V

∑i∈S

ci︸︷︷︸association

− η |S|︸︷︷︸sparsity

−λ∑i∈S

∑j /∈S

Wij︸︷︷︸connectivity

13

Minimum cut reformulationThe graph-regularized maximization of scoreQ(∗) is equivalent to a s/t-min-cut for agraph with adjacency matrixA and two additional nodes s and t, whereAij = λWijfor 1 ≤ i, j ≤ p and the weights of the edges adjacent to nodes s and t are defined as

Asi =

{ci − η if ci > η

0 otherwise and Ait ={η − ci if ci < η

0 otherwise .

SConES: Selecting Connected Explanatory SNPs.14

Experiments: Performance on simulated dataI Arabidopsis thaliana genotypes

n=500 samples, p=1 000 SNPsTAIR Protein-Protein Interaction data∼ 50.106 edges

I Higher power and lower FDR than comparison partners

except for groupLasso when groups = causal structure

I Fairly robust to missing edgesI Fails if network is random.

Image source: Jean Weber / INRA via Flickr.15

SConES: Selecting Connected Explanatory SNPsI selects connected, explanatory SNPs;

I incorporates large networks into GWAS;

I is efficient, effective and robust.

C.-A. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara and K. Borgwardt (2013) Efficientnetwork-guided multi-locus association mapping with graph cuts, Bioinformatics 29(13), i171–i179 doi:10.1093/bioinformatics/btt238https://github.com/chagaz/scones

https://github.com/chagaz/sfan

https://github.com/dominikgrimm/easyGWASCore

16

https://github.com/chagaz/sconeshttps://github.com/chagaz/sfanhttps://github.com/dominikgrimm/easyGWASCore

Increasing n

17

Multi-trait GWAS

Increase sample size by jointly performing GWAS for multiplerelated phenotypes

18

Toxicogenetics / Pharmacogenomics

Tasks (phenotypes) = chemical compounds

F. Eduati, L. Mangravite, et al. (2015) Prediction of human population responses to toxiccompounds by a collaborative competition. Nature Biotechnology, 33 (9), 933–940 doi:10.1038/nbt.3299

19

Multi-SConEST related phenotypes.

I Goal: obtain similar sets of features on related tasks.

arg maxS1,...,ST⊆V

T∑t=1

∑i∈S

ci − η |S| − λ∑i∈S

∑j /∈S

Wij − µ |St−1 ∆ St|︸︷︷︸task sharing

S ∆ S ′ = (S ∪ S ′) \ (S ∩ S ′) (symmetric difference)

I Can be reduced to single-task by building a meta-network.

20

Multi-SConES: Multiple related tasksSimulations: retrieving causal features

0

100

200

300

400

MSE

0

0.2

0.4

0.6

0.8

1.0Model 1

Model 2

0

0.2

0.4

0.6

0.8

1.0

0

0.2

0.4

0.6

0.8

1.0Model 3

0

0.2

0.4

0.6

0.8

1.0

0

50

100

150

200

020406080

100120140

0

20

40

60

80

100

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

MCC

MSE

MCC

MSE

MCC

MSE

MCC

Model 4

0

100

200

300

400

MSE

0

0.2

0.4

0.6

0.8

1.0Model 1

Model 2

0

0.2

0.4

0.6

0.8

1.0

0

0.2

0.4

0.6

0.8

1.0Model 3

0

0.2

0.4

0.6

0.8

1.0

0

50

100

150

200

020406080

100120140

0

20

40

60

80

100

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

LACR EN GL GR AG SC

Single task

CR LA GR SC

Two tasks

CR LA GR SC

Three tasks

CR LA GR SC

Four tasks

MCC

MSE

MCC

MSE

MCC

MSE

MCC

Model 4

M. Sugiyama, C.-A. Azencott, D. Grimm, Y. Kawahara and K. Borgwardt (2014) Multi-taskfeature selection on multiple networks via maximum flows, SIAM ICDM, 199–207doi:10.1137/1.9781611973440.23https://github.com/mahito-sugiyama/Multi-SConES

https://github.com/chagaz/sfan

21

https://github.com/mahito-sugiyama/Multi-SConEShttps://github.com/chagaz/sfan

Using task similarity

22

Using task similarity

Use prior knowledge about the relationship between thetasks: Ω ∈ RT×T

arg maxS1,...,ST⊆V

T∑t=1

∑i∈S

ci − η |S| − λ∑i∈S

∑j /∈S

Wij − µT∑u=1

∑i∈St∩Su

Ω−1tu︸︷︷︸task sharing

Can also be mapped to a meta-network.

Code: http://github.com/chagaz/sfan

23

http://github.com/chagaz/sfan

Using task descriptorsPhD thesis of Víctor Bellón.

24

Multiplicative Multitask Lasso with Task DescriptorsI Multitask Lasso [Obozinski et al. 2006]

argminβ∈RT×p

L(ytm,

p∑i=1

βigtmi

)︸︷︷︸

loss

+ λ

p∑i=1

||βi||2︸︷︷︸task sharing

I Multilevel Multitask Lasso [Lozano and Swirszczw, 2012]

argminθ∈Rp+,γ∈R

T×pL(ytm,

p∑i=1

θiγtigtmi

)︸︷︷︸

loss

+ λ1 ||θ||1︸︷︷︸sparsity

+ λ2

p∑i=1

T∑t=1

|γti |︸︷︷︸task sharing

I Multiplicative Multitask Lasso with Task Descriptors

argminθ∈Rp+,α∈R

p×LL(ytm,

p∑i=1

θi

(L∑l=1

αildtl

)gtmi

)︸︷︷︸

loss


+ λ2

p∑i=1

L∑l=1

|αil|︸︷︷︸task sharing

25

Multiplicative Multitask Lasso with Task Descriptors

arg minθ∈Rp+,α∈Rp×L

L

(ytm,

p∑i=1

θi

(L∑l=1

αildtl

)gtmi

)︸︷︷︸

loss


+ λ2

p∑i=1

L∑l=1

|αil|︸︷︷︸task sharing

I On simulations:I Sparser solutionI Better recovery of true features (higher PPV)I Improved stabilityI Better predictivity (RMSE).

26

Multiplicative Multitask Lasso with Task Descriptors

I Making predictions for tasks for which you have no data.

V. Bellón, V. Stoven, and C.-A. Azencott (2016) Multitask feature selection with taskdescriptors, PSB.https://github.com/vmolina/MultitaskDescriptor

27

https://github.com/vmolina/MultitaskDescriptor

Limitations of current approaches

I Robustness/stabilityRecovering the same SNPs when the data changes slightly.

I Complex interaction patterns– Limited to additive or quadrative effects– Some work on e.g. random forests + importance score.

I Statistical significance– Computing p-values– Correcting for multiple hypotheses.

28

Further challenges

PrivacyI More data→ Data sharing→ ethical concernsI How to learn from privacy-protected patient data?S. Simmons and B. Berger (2016) Realizing privacy preserving genome-wideassociation studies, Bioinformatics 32 (9), 1293–1300

29

Further challenges

HeterogeneityI Multiple relevant data sources and types

I Multiple (unknown) populations of samples.

Tumor heterogeneity Heterogeneous data sources

L. Gay et al. (2016), F1000Research

30

Further challengesRisk predictionI State of the art: Polygenic Risk Scores

Linear combination of SNPs with high p-values (summary statistics)Weighted by log odd ratios / univariate linear regression coefficients.

I More complex models slow to be adopted – reliability?H.-C. So and P. C. Sham (2017) Improving polygenic risk prediction fromsummary statistics by an empirical Bayes approach. Scientific Reports 7.S. Okser et al (2014) Regularized machine learning in the genetic prediction ofcomplex traits. PLoS Genet 10.11: e1004754.

31

Further challengesBioimage informaticsHigh-throughput molecular and cellular images

I Subcellular location analysis

I High-content screening

I Segmentation, tracking, registration.

BioImage Informatics http://bioimageinformatics.org/

Detecting cells undergoing apoptosis32

http://bioimageinformatics.org/

Further challengesElectronic health recordsI Clinical notes: incomplete, imbalanced, time series

I Combine text + images + genetics

I Assisting evidence-based medicine

R. Miotto et al. (2016) Deep Patient: An Unsupervised Representation to Predict theFuture of Patients from the Electronic Health Records Scientific Reports 6.

Machine Learning in Health Care http://mucmd.org/Previously known as Meaningful Use of Complex Medical Data

33

http://mucmd.org/

A few starting places

Data and ChallengesI DREAM Challenges: Crowdsourcing challenges for biology andmedicine http://dreamchallenges.org/

I Epidemium: Cancer research through data challengeshttp://www.epidemium.cc/

I MIMIC: Deidentified electronic health recordshttps://mimic.physionet.org/

I BioImage Informatics Challengeshttps://bii.eecs.wsu.edu/challenges/

35

http://dreamchallenges.org/http://www.epidemium.cc/https://mimic.physionet.org/https://bii.eecs.wsu.edu/challenges/


WorkshopsI Machine Learning in Healthcare at NIPS

http://www.nipsml4hc.ws/

I Machine Learing in Computational Biologyhttps://mlcb.github.io/

I Machine Learning in Systems Biology http://mlsb.cc

36

http://www.nipsml4hc.ws/https://mlcb.github.io/http://mlsb.cc


Basics in molecular biologyI Talk to specialists!

I The DNA Learning Centerhttps://www.dnalc.org/resources/

I Scitable eBookshttps://www.nature.com/scitable/ebooks

37

https://www.dnalc.org/resources/https://www.nature.com/scitable/ebooks

https://github.com/chagaz/

source: http://www.flickr.com/photos/wwworks/

CBIO: Víctor Bellón, Yunlong Jiao, Véronique Stoven, Athénaïs Vaginay,Nelle Varoquaux, Jean-Philippe Vert, Thomas Walter.MLCB Tübingen: Karsten Borgwardt, Aasa Feragen, Dominik Grimm, TheofanisKaraletsos, Niklas Kasenburg, Christoph Lippert, Barbara Rakitsch, Damian Roqueiro,Nino Shervashidze, Oliver Stegle, Mahito Sugiyama.MPI for Intelligent Systems: Lawrence Cayton, Bernhard Schölkopf.MPI for Developmental Biology: Detlef Weigel.MPI for Psychiatry: André Altmann, Tony Kam-Thong,Bertram Müller-Myhsok, Benno Pütz.Osaka University: Yoshinobu Kawahara.

38
https://github.com/chagaz/http://www.flickr.com/photos/wwworks/

high-dimensional feature selection in precision medicine · 2017. 4. 25. ·...

Documents

banque de france · 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1...

estimation of large-scale network synchronization and...

o 0 0.4 ort $ 0 008 0 0: o 00 o 0 00 o 0.4 o 0.4 p.) o.ztn...

hawaiian hoary bat acoustic monitoring on u. s. army o ahu...

ece 4070/mse 5470: physics of semiconductors and ...9!0.4...

rosaec center -...

lbl raju...

speech signal analysis · e ect of windowing | time domain...

o 0 0.4 ort $ 0 008 0 0: o 00 o 0 00 o 0.4 o 0.4 p.) o.ztn...

chr6 p11.2 b a gene density log2 (h2ax/input) log2...

mse 140, mse 140 c, mse 160, mse 160 c, mse 180, mse 180 c...

vol 467 16 september 2010 letters€“ mppfl+ gmp − + 1.2...

lecture 6 de - ethz.ch · 10 10 0.4 10 10 wobei , 0.2 0 0...

linear second-order equations...34 chapter 4. linear...

e-mail :acc@antonatalla-co.com 04 0.4 . 1.4..] ) . 44 0/0/ \

fem for elastic-plastic...

a comparison of world-wide uses of severe …tellurium 0 0...

山菜の栄養 - 日本特用林産振興会...葉柄、生...

vlfv 7klv mrxuqdo lv wkh 2zqhu 6rflhwlhv · mse stddev mse...

deep neural networks and partial differential equations...