network-guided multi-locus association mapping with...

23
Network-Guided Multi-Locus Association Mapping with Graph Cuts Chlo´ e-Agathe Azencott Machine Learning & Computational Biology Research Group Max Planck Institute for Intelligent Systems & Max Planck Institute for Developmental Biology ubingen (Germany) July 2, 2013 C.-A. Azencott JOBIM 2013-07-02 1 July 2, 2013 1

Upload: vankhanh

Post on 20-Mar-2018

231 views

Category:

Documents


2 download

TRANSCRIPT

Network-Guided Multi-Locus Association Mappingwith Graph Cuts

Chloe-Agathe Azencott

Machine Learning & Computational Biology Research Group

Max Planck Institute for Intelligent Systems &Max Planck Institute for Developmental Biology

Tubingen (Germany)

July 2, 2013

C.-A. Azencott JOBIM 2013-07-02 1 July 2, 2013 1

GWAS: Genome-Wide Association Studies

A AA

A

A

AT

C

CG

G

C

A AA

A

A

AT

G

CG

G

C

A AA

A

A

AT

C

CG

G

C

p = 105 − 107 Single Nucleotide Polymorphisms (SNPs)n = 102 − 104 samples

Which SNPs explain the phenotype?

C.-A. Azencott JOBIM 2013-07-02 2 July 2, 2013 2

Missing heritability

GWAS often fail to explain much of the heritability of complex traits.

Possible causes:

I Failure to consider rare SNPs

I Failure to account for small effect sizes

I Failure to account for the joint effects of multiple SNPs

C.-A. Azencott JOBIM 2013-07-02 3 July 2, 2013 3

Multi-locus GWAS

I Multiplicative models are intractable

I Additive models are hard to interpret

→ integrate prior knowlege

Goal: automatically discover relevant sets of SNPsthat follow an underlying network structure.

C.-A. Azencott JOBIM 2013-07-02 4 July 2, 2013 4

Feature selection with sparsity and connectivity constraints

I ncLasso: Network Connected LASSO[Li and Li, 2008]

I groupLasso, graphLasso: Overlapping Group LASSO[Jacob et al., 2009]

I Structured sparsity penalty[Huang et al., 2009]

I Path-coding penalties for DAGs[Mairal et al., 2011]

C.-A. Azencott JOBIM 2013-07-02 5 July 2, 2013 5

Feature selection with sparsity and connectivity constraints

Additive test of association

Q(S) =∑i∈S

Q(i) Q(f) =

p∑i=1

cifi = c>f

E.g. SKAT [Wu et al., 2011]

Graph-regularized maximimization of Q(∗)

argmaxf∈{0,1}p c>f︸︷︷︸association

− λ f>Lf︸ ︷︷ ︸connectivity

− η ||f ||0︸ ︷︷ ︸sparsity

Laplacian regularization

Laplacian: L = D−W

f>Lf =∑i∼j

(fi − fj)2

C.-A. Azencott JOBIM 2013-07-02 6 July 2, 2013 6

Minimum cut reformulation

Proposition

The graph-regularized maximization of score Q(∗) is equivalent to a s/t-min-cut for agraph with adjacency matrix A and two additional nodes s and t, where Aij = λWij

for 1 ≤ i, j ≤ p and the weights of the edges adjacent to nodes s and t are defined as

Asi =

{ci − η if ci > η

0 otherwiseand Ait =

{η − ci if ci < η

0 otherwise .

C.-A. Azencott JOBIM 2013-07-02 7 July 2, 2013 7

SConES: Selecting Connected Explanatory SNPs

Solve

argmaxf∈{0,1}p

c>f − λ f>Lf − η ||f ||0

by solving its min-cut reformulation with theBoykov-Kolmogorov maxflow algorithm.

C.-A. Azencott JOBIM 2013-07-02 8 July 2, 2013 8

Parameters selection

Consistency

IC(S,S ′) :=Observed(|S ∩ S ′|)− Expected(|S ∩ S ′|)Maximum(|S ∩ S ′|)− Expected(|S ∩ S ′|)

Maximum(|S ∩ S ′|) = min(|S|, |S ′|)

Expected(|S ∩ S ′|) =|S||S ′|n

IC(S,S ′) = n|S∩S′|−|S||S′|nmin(|S|,|S′|)−|S||S′|

k-fold cross-validation:

IC(S1,S2, . . . ,Sk) =k(k − 1)

2

k∑i=1

k∑j=i+1

IC(Si,Sj)

C.-A. Azencott JOBIM 2013-07-02 9 July 2, 2013 9

Networks between SNPs

1 2 3 4 5 6

1 2 3 4 5 6

1 2

3 4

5 6

1 2 3 4 5 6

gene1

1 2

3

4

5 6

7

89

1 2 3 4 5 6 7 8 9

gene1 gene2

C.-A. Azencott JOBIM 2013-07-02 10 July 2, 2013 10

Experiments: Comparison partners

Univariate linear regression yk = α0 + βGik

Lasso argminf∈Rp12||Gf − y||22 + λ ||f ||1

graphLasso, ncLasso, SConES“Gene Membership”network: SNPs near the same gene connected

groupLasso“Gene Membership” groups: SNPs near the same gene grouped together

C.-A. Azencott JOBIM 2013-07-02 11 July 2, 2013 11

Experiments: Runtime

102 103 104 105 106

#SNPs

10-2

10-1

100

101

102

103

104

105

106

CPU

runti

me [

sec]

(lo

g-s

cale

)

graphLassoncLassoncLasso (accelerated)SConESlinear regression

n = 200 exponential random network (2% density)

C.-A. Azencott JOBIM 2013-07-02 12 July 2, 2013 12

Experiments: Data simulation

Arabidopsis thaliana genotypes

n = 500 samples, p = 1, 000 SNPs

TAIR Protein-Protein Interaction data→ ∼ 50× 106 edges

20 causal SNPs: y = ω>x+ εI Causal SNPs adjacent in the genomic sequence

I Causal SNPs near the same gene

I Causal SNPs near any of 2–5 interacting genes

C.-A. Azencott JOBIM 2013-07-02 13 July 2, 2013 13

Experiments: Performance on simulated data

0.0 0.2 0.4 0.6 0.8 1.0FDR

Adjacent

0.0

0.2

0.4

0.6

0.8

1.0

Pow

er

34

18

322818

19

0.0 0.2 0.4 0.6 0.8 1.0FDR

Near the same gene

0.0

0.2

0.4

0.6

0.8

1.0

45

22

301413

13

0.0 0.2 0.4 0.6 0.8 1.0FDR

Near one of 5 interacting genes

0.0

0.2

0.4

0.6

0.8

1.0

3628

223327

14

Univariate Lasso ncLasso groupLasso graphLasso SConES

C.-A. Azencott JOBIM 2013-07-02 14 July 2, 2013 14

Experiments: Performance on simulated data

I Higher power and lower FDR than comparison partners

I Except groupLasso when groups = causal structure

I Systematically better than relaxed version (ncLasso)

I Fairly robust to missing edges

I Fails if network is random

C.-A. Azencott JOBIM 2013-07-02 15 July 2, 2013 15

Experiments: Arabidopsis thaliana flowering time

17 flowering time phenotypes [Atwell et al. 2010]

p ∼ 170, 000 SNPs (after MAF filtering) n ∼ 150 samples

165 candidate genes [Segura et al. 2012]

Correction for population structure: regress out PCs

C.-A. Azencott JOBIM 2013-07-02 16 July 2, 2013 16

Experiments: Arabidopsis thaliana flowering time

Univaria

teLas

so

grou

pLas

so

ncLas

so

SConES

0

5

10

Num

ber

ofca

ndid

ate

gene

shi

t

5

86

611

608

546Blue: number of selected SNPs

C.-A. Azencott JOBIM 2013-07-02 17 July 2, 2013 17

Experiments: Arabidopsis thaliana flowering time

Predictivity

0W

0W G

H LN

4W

8W G

H FT

FLC

FT G

HLD

VLN

16 SD

0W G

H FT

2W

8W G

H LN FR

I

FT Field

LN10

LN22

SDV

0.0

0.2

0.4

0.6

0.8

1.0

R2

Lasso

groupLasso

ncLasso

SConES

C.-A. Azencott JOBIM 2013-07-02 18 July 2, 2013 18

Availability

ISMBECCB

2 0 1 3

SIGS & TUTORIALSJULY 19–20

C O N F E R E N C EJULY 21–23

An Off icial Conference of the International Society for Computat ional Biology

ADDITIONAL KEYNOTES2013 ISCB Fellows Keynote2013 ISCB Overton Prize2013 ISCB Accomplishment by

a Senior Scientist Award

Burkhard Rost, Technical University Munich, Germany

Anna Tramontano, University of Rome, Italy

Martin Vingron, Max Planck Institute for Molecular Genetics, Berlin, Germany

CONFERENCE CO-CHAIRS

Gil Ast Sackler Medical School, Tel Aviv University, Israel

Carole Goble University of Manchester, United Kingdom

Lior Pachter University of California, Berkeley, United States

KEYNOTE SPEAKERS

PLAN TO ATTEND!www.iscb.org/ismbeccb2013

C.-A. Azencott, D. Grimm, M. Sugiyama, Y.Kawahara, and K. Borgwardt.

Efficient network-guided multi-locus associationmapping with graph cuts

Bioinformatics (2013) 29 (13): i171-i179.

Code available fromhttp://agkb.is.tuebingen.mpg.de

C.-A. Azencott JOBIM 2013-07-02 19 July 2, 2013 19

Summary

SConES

I selects connected, explanatory SNPs;

I incorporates large networks into GWAS;

I is efficient, effective and robust.

Image source: http://www.flickr.com/photos/fimbrethil/

C.-A. Azencott JOBIM 2013-07-02 20 July 2, 2013 20

Future directions

I Other structure-inducing regularizers– groups– networks

I Defining the SNP network

I Learning the SNP network

I More models of association

I Determining p-values

I GPU speed-up

I Application to Human dataCOPDGene, IHGC (migraine)

Image source: http://www.flickr.com/photos/buckaroobay/

C.-A. Azencott JOBIM 2013-07-02 21 July 2, 2013 21

Acknowledgements

MLCB TubingenDominik Grimm

Mahito Sugiyama

Karsten BorgwardtRecep Colak (U. Toronto)Barbara RakitschNino Shervashidze (INRIA)

ISIR Osaka UniversityYoshinobu Kawahara

MPI for Intelligent SystemsBernhard Scholkopf

MPI for Developmental BiologyDetlef Weigel

MPI for Psychiatry (Munich)Bertram Muller-Myhsok

Tony Kam-Thong (Roche)

C.-A. Azencott JOBIM 2013-07-02 22 July 2, 2013 22

References

S. Atwell et al.

Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines.Nature, 465(7298):627–631, 2010.

J. Huang, T. Zhang, and D. Metaxas.

Learning with structured sparsity.In ICML, pages 417–424, New York, NY, USA, 2009.

L. Jacob, G. Obozinski, and J.-P. Vert.

Group lasso with overlap and graph lasso.In ICML, pages 433–440, 2009.

C. Li and H. Li.

Network-constrained regularization and variable selection for analysis of genomic data.Bioinformatics, 24(9):1175–1182, 2008.

J. Mairal and B. Yu.

Path coding penalties for directed acyclic graphs.In NIPS OPT, 2011.

V. Segura et al.

An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations.Nat Genet, 44(7):825–830, 2012.

C.-A. Azencott JOBIM 2013-07-02 23 July 2, 2013 23