networks of protein interactions introduction and integration balaji s. srinivasan cs 374 lecture 5...

39
Networks of Protein Interactions Introduction and Integration Balaji S. Srinivasan CS 374 Lecture 5 10/11/2005

Post on 21-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Networks of Protein InteractionsIntroduction and Integration

Balaji S. Srinivasan

CS 374

Lecture 5

10/11/2005

Overview Genomics

1 genome Assembly, Gene Finding

Comparative Genomics N genomes Sequence Alignment

Functional Genomics 1 assay Microarray Analysis

Integrative Genomics N assays Network Integration

(this talk)

Coexpression

1

.81

1

-.6

-.7Gene A

Gene B

Gene C

Ge

ne

B

Ge

ne

A

Ge

ne

C

Pearson Correlation

=.8

-.7 -.6

Expression

Genes

Arrays

Microarray data

Coinheritance

1

.951

1

-.95

-1Protein A

Protein B

Protein C

Pro

tein

B

Pro

tein

A

Pro

tein

C

.95

-1 -.95

=

Spearman Correlation

600200300100

500

100300200400

250 250 50

Protein A

Protein B

Protein C

Sp

ec

ies

2

Sp

ec

ies

1

Sp

ec

ies

4

Sp

ec

ies

3Inheritance

BLAST bit scores

Colocation

0

.060

0

.25

.25Protein A

Protein B

Protein C

Pro

tein

B

Pro

tein

A

Pro

tein

C

Average chromosomal distance

.06

.25 .25

=.6.2.3.1

.5

.1.3.2.4

.25 .25 .05

Protein A

Protein B

Protein C

Ch

rom

2

Ch

rom

1

Ch

rom

4

Ch

rom

3

Location

Assembled Genomes

Coevolution

1

.91

1

-.7

-.8Prt Fam A

Prt Fam B

Prt Fam C

Prt

Fa

m B

Prt

Fa

m A

Prt

Fa

m C

Tree Distances

.9

-.8 -.7

=

C’’

Evolution

A A’ A’’ A’’’

B’ B’’ B’’’B C’ C’’’C

Multiple Alignments

Functional Genomics

Many others… Experimental

TAP + Mass Spec Y2H Pheno & antibody arrays Synthetical lethal RNAi knockdown

Computational Rosetta Stone (conserved

domain) Shared Operon PSIMAP

Experimental

Computational

Integration Motivation

Can we combine data? Example: Caulobacter

crescentus flagellar proteins Coexpression cluster Compare to

coinheritance Potential for integration…

Coexpression

Coinheritance

How to use 2 predictors? Agree & disagree… Scales, noise levels, sources, very different Can we do network integration ?

coinheritancecoexpression

Early Integration Hacks Given 2 nets

intersection union average weights

+

G1 = (V1,E1)coexpression

V1,V2 ∈ (V , set of all proteins)

coinheritance

G2 = (V2,E2)

=

E isc =1 if (E1 > T1) || (E2 > T2)

Eunion =1 if (E1 > T1) & & (E2 > T2)

Eavg = .5(E1 +E2)

Early Integration Hacks

.9.8

.7 .6

Coexpression

.5

.7

.8

.9

Coinheritance

+ =

Intersection

Too strict Too lenient Too dumb :)

Union

.65

.35

.45

.75

Average

.35

.4

Early Integration Hacks

Useful dumb… All data equal? No explicit, statistical

formulation diff noise levels diff intervals

Uninformed by prior data…

.65

.35

.85

.75

Average

.35

Too dumb

Recent work

Bayesian Networks (Troyanskaya 2003) Decision Trees (Wong 2004)

Naïve Bayes + Boosting (Lu 2005)Likelihood Ratios (Lee 2004)

Recent work

Major innovation: Training Set

MIPS, “Gold Standard” (Gerstein)

SSL, synthetic lethals (Wong)

DIP (Marcotte) Defines the signal

What is our algorithm learning?

KEGG (Pyrimidine Metabolism)

Recent work Major limitations

Method specific Decision trees

binary coding Bayesian Networks

need to poll people for prior All methods

Biological: limited to yeast

Statistical dependency hacks! Lee: heuristic weighting Naïve Bayes

Naïve Bayes (Lu 2005)

Heuristic Weighting (Lee 2004)

Recap Just shown

Functional Genomics Integration Problem Previous work

all in S. cerevisiae major innovation: training

set major shortcoming:

dependence hacks To come

training set, common scale rigorous statistical

dependence microbes only (for now…)

+ + + …

coexpression coinheritance colocation

Training Set Observation

Known linkages for nontrivial fraction of pairs

Caulobacter crescentus KEGG: 783 of 3737

proteins in 1 or more KEGG pathways

Ex: pyrimidine metabolism, pathway 240

Training Set

Tabulate pairs 1 if shared

COG/KEGG/GO 0 if unshared ? If one or both unknown

Most pairs totally unknown…

Training Sets Most pairs totally unknown…

Caulobacter crescentus 3737 proteins, 783 KEGG

small in relative terms large in absolute terms

6667480 pairs

6980716 pairs

=3737C2All pairs: L=0,1,?

298961 pairs+

14275 pairs+

043.23737

2783 =CC

relative frequency: training pairs vs. all pairs

Training Sets

6667480 pairs

298961 pairs

14275 pairs

All pairs: L=0,1,?6980716 pairs

Training Sets

Training data is crucial Reveals hidden structure Small effort yields large

payoff L=0,1,? stats

Puts data on common scale meter in biology

(predictive power), not physics (units)

add training set

raw data hidden structure

Bayes’ Rule in 1D

Predict Linkages Bayes’ Rule Coexpression

Evaluate posterior at millions of pairs

P(L=1|E) for L=? Optimal decision rule

“Bayes error rate” = min. error rate of

classifier

∑=

L

LPLEP

LPLEPELP

)()|(

)()|()|(

Bayes’ Rule: Calculateconditional probability oflinkage given evidence

2D Network Integration

Account for statistical dependence 2D Scatterplot

coexpression vs. coinheritance

2D Network Integration

Estimate densities Kernel Density Estimation Gray-Moore dual tree algorithm (digression #1)

2D Network Integration

2D Network Integration Posterior probability of interaction

P(L=1|E) visual, geometric interpretation

P(L =1 | E) = .9

P(L =1 | E) = .5

P(L =1 | E) = .1

2D Network Integration

Hacks revisited Intersection Union Average

All are suboptimal… including decision trees,

naïve bayes, etc.

Hidden Biology

Dividend of Network Integration Joint density reveals

hidden biology Moderate evidence from

multiple sources! Subtle interactions

missed by univariate methods…

Recap #2 Just shown

Training set: scale to common axes

Scatterplot + KDE Posterior probability of

interaction Hidden biology

To show generalizations N evidences, arbitrary

microbes…

Using N predictors Example with N = 3 (coinheritance, colocation, coexpression)

note evidence coupling high colocation compensates for low coexpression

nonlinear reln. revealed by joint density…

)1|,,( 321 =LEEEP)0|,,( 321 =LEEEP ),,|1( 321 EEELP =

coex

pres

sion

(E1)

colocation (E2)

coinheritance (E3)

Binary Classifier Paradigm Pair w/ unknown linkage status

given interaction predictors predict func association

A B

L=?E known

Different Function

A B

L=0

Same Function

A B

L=1

P(L|E)

Blessing of Dimensionality

Classifier builds network Binary classifier on pairs

apply to all microbes, all protein pairs in 230 species first rigorous nets for many human pathogens

Escherichia coli K12 Helicobacter pylori 26695 Caulobacter crescentus

MreB Example: MreB

relative of eukaryotic actin predict interaction partners

CtrA and CcrM

Laub et al., 2000

C. jejuni glycosylation

Eukaryote-like N-linked glycosylation mysterious biotechnological & clinical importance

Nets speed experiment

Sec partners for MreB Natalie Dye

Predicting mislocalization Grant Bowman, Esteban

Toro

Interacting 2-component proteins Nathan Hillson

Recap

integrate data sources non-naïvely

rigorous probabilistic formulation

moderate evidence from multiple sources ),...,,|( 21 nEEELP

Result: Unified p-value for prob. of functional linkage given all evidence.

Further Directions

Whatcha gonna do with it? M genomes + N assays Comparative Genomics +

Integrative Genomics = Network Alignment

To be continued…