networks of protein interactions introduction and integration balaji s. srinivasan cs 374 lecture 5...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Networks of Protein InteractionsIntroduction and Integration
Balaji S. Srinivasan
CS 374
Lecture 5
10/11/2005
Overview Genomics
1 genome Assembly, Gene Finding
Comparative Genomics N genomes Sequence Alignment
Functional Genomics 1 assay Microarray Analysis
Integrative Genomics N assays Network Integration
(this talk)
Coexpression
1
.81
1
-.6
-.7Gene A
Gene B
Gene C
Ge
ne
B
Ge
ne
A
Ge
ne
C
Pearson Correlation
=.8
-.7 -.6
Expression
Genes
Arrays
Microarray data
Coinheritance
1
.951
1
-.95
-1Protein A
Protein B
Protein C
Pro
tein
B
Pro
tein
A
Pro
tein
C
.95
-1 -.95
=
Spearman Correlation
600200300100
500
100300200400
250 250 50
Protein A
Protein B
Protein C
Sp
ec
ies
2
Sp
ec
ies
1
Sp
ec
ies
4
Sp
ec
ies
3Inheritance
BLAST bit scores
Colocation
0
.060
0
.25
.25Protein A
Protein B
Protein C
Pro
tein
B
Pro
tein
A
Pro
tein
C
Average chromosomal distance
.06
.25 .25
=.6.2.3.1
.5
.1.3.2.4
.25 .25 .05
Protein A
Protein B
Protein C
Ch
rom
2
Ch
rom
1
Ch
rom
4
Ch
rom
3
Location
Assembled Genomes
Coevolution
1
.91
1
-.7
-.8Prt Fam A
Prt Fam B
Prt Fam C
Prt
Fa
m B
Prt
Fa
m A
Prt
Fa
m C
Tree Distances
.9
-.8 -.7
=
C’’
Evolution
A A’ A’’ A’’’
B’ B’’ B’’’B C’ C’’’C
Multiple Alignments
Functional Genomics
Many others… Experimental
TAP + Mass Spec Y2H Pheno & antibody arrays Synthetical lethal RNAi knockdown
Computational Rosetta Stone (conserved
domain) Shared Operon PSIMAP
Experimental
Computational
Integration Motivation
Can we combine data? Example: Caulobacter
crescentus flagellar proteins Coexpression cluster Compare to
coinheritance Potential for integration…
Coexpression
Coinheritance
How to use 2 predictors? Agree & disagree… Scales, noise levels, sources, very different Can we do network integration ?
coinheritancecoexpression
≠
Early Integration Hacks Given 2 nets
intersection union average weights
+
€
G1 = (V1,E1)coexpression
€
V1,V2 ∈ (V , set of all proteins)
coinheritance
€
G2 = (V2,E2)
=
€
E isc =1 if (E1 > T1) || (E2 > T2)
Eunion =1 if (E1 > T1) & & (E2 > T2)
Eavg = .5(E1 +E2)
Early Integration Hacks
.9.8
.7 .6
Coexpression
.5
.7
.8
.9
Coinheritance
+ =
Intersection
Too strict Too lenient Too dumb :)
Union
.65
.35
.45
.75
Average
.35
.4
Early Integration Hacks
Useful dumb… All data equal? No explicit, statistical
formulation diff noise levels diff intervals
Uninformed by prior data…
.65
.35
.85
.75
Average
.35
Too dumb
Recent work
Bayesian Networks (Troyanskaya 2003) Decision Trees (Wong 2004)
Naïve Bayes + Boosting (Lu 2005)Likelihood Ratios (Lee 2004)
Recent work
Major innovation: Training Set
MIPS, “Gold Standard” (Gerstein)
SSL, synthetic lethals (Wong)
DIP (Marcotte) Defines the signal
What is our algorithm learning?
KEGG (Pyrimidine Metabolism)
Recent work Major limitations
Method specific Decision trees
binary coding Bayesian Networks
need to poll people for prior All methods
Biological: limited to yeast
Statistical dependency hacks! Lee: heuristic weighting Naïve Bayes
Naïve Bayes (Lu 2005)
Heuristic Weighting (Lee 2004)
Recap Just shown
Functional Genomics Integration Problem Previous work
all in S. cerevisiae major innovation: training
set major shortcoming:
dependence hacks To come
training set, common scale rigorous statistical
dependence microbes only (for now…)
+ + + …
coexpression coinheritance colocation
…
Training Set Observation
Known linkages for nontrivial fraction of pairs
Caulobacter crescentus KEGG: 783 of 3737
proteins in 1 or more KEGG pathways
Ex: pyrimidine metabolism, pathway 240
Training Set
Tabulate pairs 1 if shared
COG/KEGG/GO 0 if unshared ? If one or both unknown
Most pairs totally unknown…
Training Sets Most pairs totally unknown…
Caulobacter crescentus 3737 proteins, 783 KEGG
small in relative terms large in absolute terms
6667480 pairs
6980716 pairs
€
=3737C2All pairs: L=0,1,?
298961 pairs+
14275 pairs+
043.23737
2783 =CC
relative frequency: training pairs vs. all pairs
Training Sets
Training data is crucial Reveals hidden structure Small effort yields large
payoff L=0,1,? stats
Puts data on common scale meter in biology
(predictive power), not physics (units)
add training set
raw data hidden structure
Bayes’ Rule in 1D
Predict Linkages Bayes’ Rule Coexpression
Evaluate posterior at millions of pairs
P(L=1|E) for L=? Optimal decision rule
“Bayes error rate” = min. error rate of
classifier
∑=
L
LPLEP
LPLEPELP
)()|(
)()|()|(
Bayes’ Rule: Calculateconditional probability oflinkage given evidence
2D Network Integration
Account for statistical dependence 2D Scatterplot
coexpression vs. coinheritance
2D Network Integration
Estimate densities Kernel Density Estimation Gray-Moore dual tree algorithm (digression #1)
2D Network Integration Posterior probability of interaction
P(L=1|E) visual, geometric interpretation
€
P(L =1 | E) = .9
€
P(L =1 | E) = .5
€
P(L =1 | E) = .1
2D Network Integration
Hacks revisited Intersection Union Average
All are suboptimal… including decision trees,
naïve bayes, etc.
Hidden Biology
Dividend of Network Integration Joint density reveals
hidden biology Moderate evidence from
multiple sources! Subtle interactions
missed by univariate methods…
Recap #2 Just shown
Training set: scale to common axes
Scatterplot + KDE Posterior probability of
interaction Hidden biology
To show generalizations N evidences, arbitrary
microbes…
Using N predictors Example with N = 3 (coinheritance, colocation, coexpression)
note evidence coupling high colocation compensates for low coexpression
nonlinear reln. revealed by joint density…
)1|,,( 321 =LEEEP)0|,,( 321 =LEEEP ),,|1( 321 EEELP =
coex
pres
sion
(E1)
colocation (E2)
coinheritance (E3)
Binary Classifier Paradigm Pair w/ unknown linkage status
given interaction predictors predict func association
A B
L=?E known
Different Function
A B
L=0
Same Function
A B
L=1
P(L|E)
Classifier builds network Binary classifier on pairs
apply to all microbes, all protein pairs in 230 species first rigorous nets for many human pathogens
Escherichia coli K12 Helicobacter pylori 26695 Caulobacter crescentus
C. jejuni glycosylation
Eukaryote-like N-linked glycosylation mysterious biotechnological & clinical importance
Nets speed experiment
Sec partners for MreB Natalie Dye
Predicting mislocalization Grant Bowman, Esteban
Toro
Interacting 2-component proteins Nathan Hillson
Recap
integrate data sources non-naïvely
rigorous probabilistic formulation
moderate evidence from multiple sources ),...,,|( 21 nEEELP
Result: Unified p-value for prob. of functional linkage given all evidence.