network-guided multi-locus association mapping with...
TRANSCRIPT
Network-Guided Multi-Locus Association Mappingwith Graph Cuts
Chloe-Agathe Azencott
Machine Learning & Computational Biology Research Group
Max Planck Institute for Intelligent Systems &Max Planck Institute for Developmental Biology
Tubingen (Germany)
July 2, 2013
C.-A. Azencott JOBIM 2013-07-02 1 July 2, 2013 1
GWAS: Genome-Wide Association Studies
A AA
A
A
AT
C
CG
G
C
A AA
A
A
AT
G
CG
G
C
A AA
A
A
AT
C
CG
G
C
p = 105 − 107 Single Nucleotide Polymorphisms (SNPs)n = 102 − 104 samples
Which SNPs explain the phenotype?
C.-A. Azencott JOBIM 2013-07-02 2 July 2, 2013 2
Missing heritability
GWAS often fail to explain much of the heritability of complex traits.
Possible causes:
I Failure to consider rare SNPs
I Failure to account for small effect sizes
I Failure to account for the joint effects of multiple SNPs
C.-A. Azencott JOBIM 2013-07-02 3 July 2, 2013 3
Multi-locus GWAS
I Multiplicative models are intractable
I Additive models are hard to interpret
→ integrate prior knowlege
Goal: automatically discover relevant sets of SNPsthat follow an underlying network structure.
C.-A. Azencott JOBIM 2013-07-02 4 July 2, 2013 4
Feature selection with sparsity and connectivity constraints
I ncLasso: Network Connected LASSO[Li and Li, 2008]
I groupLasso, graphLasso: Overlapping Group LASSO[Jacob et al., 2009]
I Structured sparsity penalty[Huang et al., 2009]
I Path-coding penalties for DAGs[Mairal et al., 2011]
C.-A. Azencott JOBIM 2013-07-02 5 July 2, 2013 5
Feature selection with sparsity and connectivity constraints
Additive test of association
Q(S) =∑i∈S
Q(i) Q(f) =
p∑i=1
cifi = c>f
E.g. SKAT [Wu et al., 2011]
Graph-regularized maximimization of Q(∗)
argmaxf∈{0,1}p c>f︸︷︷︸association
− λ f>Lf︸ ︷︷ ︸connectivity
− η ||f ||0︸ ︷︷ ︸sparsity
Laplacian regularization
Laplacian: L = D−W
f>Lf =∑i∼j
(fi − fj)2
C.-A. Azencott JOBIM 2013-07-02 6 July 2, 2013 6
Minimum cut reformulation
Proposition
The graph-regularized maximization of score Q(∗) is equivalent to a s/t-min-cut for agraph with adjacency matrix A and two additional nodes s and t, where Aij = λWij
for 1 ≤ i, j ≤ p and the weights of the edges adjacent to nodes s and t are defined as
Asi =
{ci − η if ci > η
0 otherwiseand Ait =
{η − ci if ci < η
0 otherwise .
C.-A. Azencott JOBIM 2013-07-02 7 July 2, 2013 7
SConES: Selecting Connected Explanatory SNPs
Solve
argmaxf∈{0,1}p
c>f − λ f>Lf − η ||f ||0
by solving its min-cut reformulation with theBoykov-Kolmogorov maxflow algorithm.
C.-A. Azencott JOBIM 2013-07-02 8 July 2, 2013 8
Parameters selection
Consistency
IC(S,S ′) :=Observed(|S ∩ S ′|)− Expected(|S ∩ S ′|)Maximum(|S ∩ S ′|)− Expected(|S ∩ S ′|)
Maximum(|S ∩ S ′|) = min(|S|, |S ′|)
Expected(|S ∩ S ′|) =|S||S ′|n
IC(S,S ′) = n|S∩S′|−|S||S′|nmin(|S|,|S′|)−|S||S′|
k-fold cross-validation:
IC(S1,S2, . . . ,Sk) =k(k − 1)
2
k∑i=1
k∑j=i+1
IC(Si,Sj)
C.-A. Azencott JOBIM 2013-07-02 9 July 2, 2013 9
Networks between SNPs
1 2 3 4 5 6
1 2 3 4 5 6
1 2
3 4
5 6
1 2 3 4 5 6
gene1
1 2
3
4
5 6
7
89
1 2 3 4 5 6 7 8 9
gene1 gene2
C.-A. Azencott JOBIM 2013-07-02 10 July 2, 2013 10
Experiments: Comparison partners
Univariate linear regression yk = α0 + βGik
Lasso argminf∈Rp12||Gf − y||22 + λ ||f ||1
graphLasso, ncLasso, SConES“Gene Membership”network: SNPs near the same gene connected
groupLasso“Gene Membership” groups: SNPs near the same gene grouped together
C.-A. Azencott JOBIM 2013-07-02 11 July 2, 2013 11
Experiments: Runtime
102 103 104 105 106
#SNPs
10-2
10-1
100
101
102
103
104
105
106
CPU
runti
me [
sec]
(lo
g-s
cale
)
graphLassoncLassoncLasso (accelerated)SConESlinear regression
n = 200 exponential random network (2% density)
C.-A. Azencott JOBIM 2013-07-02 12 July 2, 2013 12
Experiments: Data simulation
Arabidopsis thaliana genotypes
n = 500 samples, p = 1, 000 SNPs
TAIR Protein-Protein Interaction data→ ∼ 50× 106 edges
20 causal SNPs: y = ω>x+ εI Causal SNPs adjacent in the genomic sequence
I Causal SNPs near the same gene
I Causal SNPs near any of 2–5 interacting genes
C.-A. Azencott JOBIM 2013-07-02 13 July 2, 2013 13
Experiments: Performance on simulated data
0.0 0.2 0.4 0.6 0.8 1.0FDR
Adjacent
0.0
0.2
0.4
0.6
0.8
1.0
Pow
er
34
18
322818
19
0.0 0.2 0.4 0.6 0.8 1.0FDR
Near the same gene
0.0
0.2
0.4
0.6
0.8
1.0
45
22
301413
13
0.0 0.2 0.4 0.6 0.8 1.0FDR
Near one of 5 interacting genes
0.0
0.2
0.4
0.6
0.8
1.0
3628
223327
14
Univariate Lasso ncLasso groupLasso graphLasso SConES
C.-A. Azencott JOBIM 2013-07-02 14 July 2, 2013 14
Experiments: Performance on simulated data
I Higher power and lower FDR than comparison partners
I Except groupLasso when groups = causal structure
I Systematically better than relaxed version (ncLasso)
I Fairly robust to missing edges
I Fails if network is random
C.-A. Azencott JOBIM 2013-07-02 15 July 2, 2013 15
Experiments: Arabidopsis thaliana flowering time
17 flowering time phenotypes [Atwell et al. 2010]
p ∼ 170, 000 SNPs (after MAF filtering) n ∼ 150 samples
165 candidate genes [Segura et al. 2012]
Correction for population structure: regress out PCs
C.-A. Azencott JOBIM 2013-07-02 16 July 2, 2013 16
Experiments: Arabidopsis thaliana flowering time
Univaria
teLas
so
grou
pLas
so
ncLas
so
SConES
0
5
10
Num
ber
ofca
ndid
ate
gene
shi
t
5
86
611
608
546Blue: number of selected SNPs
C.-A. Azencott JOBIM 2013-07-02 17 July 2, 2013 17
Experiments: Arabidopsis thaliana flowering time
Predictivity
0W
0W G
H LN
4W
8W G
H FT
FLC
FT G
HLD
VLN
16 SD
0W G
H FT
2W
8W G
H LN FR
I
FT Field
LN10
LN22
SDV
0.0
0.2
0.4
0.6
0.8
1.0
R2
Lasso
groupLasso
ncLasso
SConES
C.-A. Azencott JOBIM 2013-07-02 18 July 2, 2013 18
Availability
ISMBECCB
2 0 1 3
SIGS & TUTORIALSJULY 19–20
C O N F E R E N C EJULY 21–23
An Off icial Conference of the International Society for Computat ional Biology
ADDITIONAL KEYNOTES2013 ISCB Fellows Keynote2013 ISCB Overton Prize2013 ISCB Accomplishment by
a Senior Scientist Award
Burkhard Rost, Technical University Munich, Germany
Anna Tramontano, University of Rome, Italy
Martin Vingron, Max Planck Institute for Molecular Genetics, Berlin, Germany
CONFERENCE CO-CHAIRS
Gil Ast Sackler Medical School, Tel Aviv University, Israel
Carole Goble University of Manchester, United Kingdom
Lior Pachter University of California, Berkeley, United States
KEYNOTE SPEAKERS
PLAN TO ATTEND!www.iscb.org/ismbeccb2013
C.-A. Azencott, D. Grimm, M. Sugiyama, Y.Kawahara, and K. Borgwardt.
Efficient network-guided multi-locus associationmapping with graph cuts
Bioinformatics (2013) 29 (13): i171-i179.
Code available fromhttp://agkb.is.tuebingen.mpg.de
C.-A. Azencott JOBIM 2013-07-02 19 July 2, 2013 19
Summary
SConES
I selects connected, explanatory SNPs;
I incorporates large networks into GWAS;
I is efficient, effective and robust.
Image source: http://www.flickr.com/photos/fimbrethil/
C.-A. Azencott JOBIM 2013-07-02 20 July 2, 2013 20
Future directions
I Other structure-inducing regularizers– groups– networks
I Defining the SNP network
I Learning the SNP network
I More models of association
I Determining p-values
I GPU speed-up
I Application to Human dataCOPDGene, IHGC (migraine)
Image source: http://www.flickr.com/photos/buckaroobay/
C.-A. Azencott JOBIM 2013-07-02 21 July 2, 2013 21
Acknowledgements
MLCB TubingenDominik Grimm
Mahito Sugiyama
Karsten BorgwardtRecep Colak (U. Toronto)Barbara RakitschNino Shervashidze (INRIA)
ISIR Osaka UniversityYoshinobu Kawahara
MPI for Intelligent SystemsBernhard Scholkopf
MPI for Developmental BiologyDetlef Weigel
MPI for Psychiatry (Munich)Bertram Muller-Myhsok
Tony Kam-Thong (Roche)
C.-A. Azencott JOBIM 2013-07-02 22 July 2, 2013 22
References
S. Atwell et al.
Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines.Nature, 465(7298):627–631, 2010.
J. Huang, T. Zhang, and D. Metaxas.
Learning with structured sparsity.In ICML, pages 417–424, New York, NY, USA, 2009.
L. Jacob, G. Obozinski, and J.-P. Vert.
Group lasso with overlap and graph lasso.In ICML, pages 433–440, 2009.
C. Li and H. Li.
Network-constrained regularization and variable selection for analysis of genomic data.Bioinformatics, 24(9):1175–1182, 2008.
J. Mairal and B. Yu.
Path coding penalties for directed acyclic graphs.In NIPS OPT, 2011.
V. Segura et al.
An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations.Nat Genet, 44(7):825–830, 2012.
C.-A. Azencott JOBIM 2013-07-02 23 July 2, 2013 23