reconstructing biological networks from data:...

74
reconstructing biological networks from data: cMonkey & Inferelator Richard Bonneau [email protected] http://www.cs.nyu.edu/ ~bonneau/ New York University, Dept. of Biology & Computer Science Dept. Wednesday, June 24, 2009

Upload: lamdieu

Post on 27-Jun-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

reconstructing biological networks from data: cMonkey & Inferelator

Richard [email protected]

http://www.cs.nyu.edu/~bonneau/

New York University,

Dept. of Biology &

Computer Science Dept.

!"#$"%&'(%&)"#(*+!,

#"- .(%/ 0#+1"%,+$.2#3&,.,$"*,&4+(5().

Wednesday, June 24, 2009

Close agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations. Morozov, Kortemme, Baker, PNAS, 2003

Oligo(N-aryl glycines): A New Twist on Structured Peptoids, Shah, Butterfoss, Bonneau, Kirshenbaum, 2008, JACS

justification of functional-from-principles, parameters-from-data by example

Wednesday, June 24, 2009

imd1 TR(Hrg) asnc VNG1845Coxygen

arsr

220

232310 338339

396407 411

431

448 455

16533 9977 39180

(Bi)clustering

aNX

E

P

S

A.

B.

D.

C.

1

98

2

29 3

7

61

124

163

205

141

184

53

15

100

1

Data

Dynamical

network model

Prediction

overview

1. co-regulatedmodules (integrate data types).

2. Learn topology and Dynamics withgreedy / local aprox.(inferelator 1.0, 1.1)

3. improving performanceover multiple time-scales(Inferelator 2.x)

Main results:

- Surprising predictive performance forprokaryotic networks, T-cell and macrophage differentiationEE Networks

- Longer time scale stability

- model flexibility

Wednesday, June 24, 2009

DNA RNA

microarrays

cDNAESTs

libraries ofFunctional RNA

phenotype

Automatedmicroscopy,etc.

Gene sequencingWhole Genome assembly

ChIP-chipTF-DNA Bindingexperiments

protein sequence databases,

Protein structures,

Proteomics

Protein-proteininteractions

protein

http://www.molbio.uoregon.edu/images/research/spragueg1.jpg

Metabolomics

Mass-spectroscopyNMRChromotography

Genotype & sequencing

Measuring affinities / binding

Measuring Levels

Assaying functional outcome

Wednesday, June 24, 2009

algorithms:David J. Reiss (cMonkey)Vesteinn Thorsson (Inferelator) Richard Bonneau functional genomics:Marc T. FacciottiAmy Schmid, Kenia WhiteheadMin Pan, Amardeep Kaur,Leroy HoodNitin S. Baliga

Wednesday, June 24, 2009

An example : Halobacterium

why halobacterium:• if your friends are working on

halo ... (Hood, Baliga)• not a “model” system (originally)• high IQ• diverse environment• small genome• good genetics, cultivable, etc. • a very tough extremophile,

bioengineering

Data collection and modeling effort✴ genome and genome annotation✴ microarrays✴ genetic and environmental perturbations✴ proteomics✴ ChIP-chip✴ some protein-protein

Wednesday, June 24, 2009

fructose,manribosomeatp,cobprecorrin,metdipeptideradA,hjr,smcrpooxphosk+ transmettrpDNA repairgvpFe!SmutS,dcdglycerol kinaseLPSmutS, primaseZn, + transnicotinamidephytoene,cyrptbat,bopaa trans and metarssirRphos upsop1O2!stressftsZ,cctA,flafla,cctB

Oxygen

Light

Iron

Metals

Radiation

VN

G6288C

VN

G1405C

asnC

gvpE

2tb

pC

.DkaiC

cspd1

ars

Rtb

pE

VN

G0751C

tfbF

thh3

VN

G2020C

tfbG

bat

VN

G2641H

cspd2

VN

G2614H

trh7

2.0

-2.0

! = 0.0

Halobacterium dataset including

>800 microarraystime seriesknock outs

ChIP-chip experiments

proteomics

phenotype

among the mostcomplete prokaryotic datasets

M. Facciotti, N. Baliga

min pan, Kenia Whitehead, Amy Schmid

Wednesday, June 24, 2009

cMonkey

integrative

biclustering

Expresion

Networks

Upstream

Biclusters,motifs,

subnetworks

Inferelator

inference of

dynamic regulatory

networks

Regulatorynetworkmodel

overview:

Wednesday, June 24, 2009

cspd2

VNG0040C

rhl

VNG0320H

3

kaic

AND

trh7

tbpe

AND

tbpd

4

tbpc

5

7

phou

AND

AND

VNG0293H

VNG2641H

AND

gvpe2

1H

VNG1029C

snp

VNG0194H

idr2

28

29

pai1

VNG2614H

4950

AND

AND

AND

52

AND

53

AND

78

tfbg

boa3

113

140

173

AND AND

Biological motivation:Learn regulatory interactions from data that are predictive of equilibrium and dynamical systems behavior

Challenges: Interactions between transcription factors Number of interactions much greater than number of observations Heterogeneous data (e.g. equilibrium and kinetic measurements) Indirect effects and noise (Causal symmetry between activators and

targets) Resultant models are a complex low-level abstraction

of the systems behavior

II. The Inferelator: regulatory network inference

Wednesday, June 24, 2009

experimental design

τ dydt

= −y + g(β • Z ) (1)

βZ = β1x1 + β2x2 + β3 min(x1, x2 ) (2)

10 7 6 8 5 1 2 9 3 4

6

2

4

1

11

8

9

7

3

5

10

8 1 5 7 6 9 3 2 4 10

2

1

4

11

9

8

6

7

3

5

10

Optimal (?) :-) the usual :-(

1. know your model/framework:

2. Time series:sampling with correctrate(s)!

3. samplingin correct regime.

4. Multifactorial on a budget:

Wednesday, June 24, 2009

Inferelator v 1.0

steady-state kinetic

Bonneau, Reiss, Baliga, Thorsson, 2006

yg( )

z = x1z = min(x1,x2)

Wednesday, June 24, 2009

steady state

time series/ kinetic

core assumption

steady-state

kinetic

Bonneau, Baliga, Thorsson, 2006

Wednesday, June 24, 2009

2 Squashing functions: promoter saturation

Wednesday, June 24, 2009

Representing Interactions:

y j = β1x1 j + β2x2 j + β3 min(x1 j ,x2 j ) +1

yj = β1x1 β2x2 β1x1⊕β2x2( )

(,,⊕) x⊕ y := min{x, y}

x y := x + y

Wednesday, June 24, 2009

Model selection using L1-shrinkage:avoiding overfitting

model size

(α̂, β̂) =(α̂ , β̂ )

argmin yi −α − β j zijj=1

p

∑⎛

⎝⎜⎞

⎠⎟

2

i=1

N

∑⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

why L1?beta -> 0LARS

β̂ jj=1

p

∑ ≤ t βols jj=1

p

Wednesday, June 24, 2009

0.25 0.3 0.35 0.4 0.45 0.5 0.55

0.2

0.3

0.4

0.5

0.6

training error

ne

w d

ata

err

or

RMS error over 300 biclusters

1

2

3

4

5

6

Counts

0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.2

0.4

0.6

0.8

1

training cor

ne

w d

ata

co

r

Cor over 300 biclust

12346789101112131416171819

Counts

B. C.

D. E. F.

RMSD over trianing

RMSD (%var)

Fre

qu

en

cy

0.2 0.4 0.6 0.8

02

06

01

00

A.

mean = 0.369

RMSD (new conditions)

RMSD (%var)

Fre

qu

en

cy

0.2 0.4 0.6 0.8

02

04

06

08

0 mean = 0.375

Cor over trianing

Corr true vs. pred

Fre

qu

en

cy

0.0 0.2 0.4 0.6 0.8 1.0

04

08

01

20

mean = 0.788

Cor (new conditions) over

Corr true vs. pred

Fre

qu

en

cy

0.0 0.2 0.4 0.6 0.8 1.0

02

04

06

08

0 mean = 0.807

Predictive power over 130 new experiments

77. Amino acid uptake

! "

123 .Cell motility

150. Ribosome

205 . Phosphte uptake209 . Cation/ Zn transport

214 . Fe transport

217. Fe-S clusters, Heavy metal transport

244. Bop, DMSO resperation

251. DNA repair, nucleotide metabolism

258. Phosphate consumption

273. Pyrimidine biosynthesis

69 . K transport

Wednesday, June 24, 2009

Prediction of outcome following genetic and Environmental perturbations

Wednesday, June 24, 2009

VN G0040C

AN D

AN D

217

AN D

AN D

VN G2163HAN D

AN D

69

AN D

AN D

AN D

VN G0293H

125

257

214

289

251

282

86

205

150

264

232

77 3

238

6

11

215

273

174

163

124

209

79

68258

AN D

83

123

298

226

AN D

AN D

28

AN D

trh3

trh5

trh7

trh4

tbpd

csp d1

phou ka ic

rhl

imd1

bat

idr2

asn c

Fe transport, heme-aerotaxisDNA repair and mixed nucleotide metabolismPotasium transportpyridine biosythesisPhototrophy and DMSO metabolismCell motilityUnknown / MixedPhosphate uptakeAmino acid uptakeColbamine bisynthesis Phosphate consumptionCation / Zinc transportRibosomeFe-S clusters, Heavy metal transport, molybendum cofactor biosynthesis

VN G6 88C2

156

VN G0156C

inferred trh controlled subnetwork:

Homeostasisis an emergent

property ofthe global

network

Bonneau, et al, Genome Biology, 2006, Bonneau, et al. Cell, 2007

edges validated since by ChIP-chipM. Facciotti, N. Baliga

Wednesday, June 24, 2009

Finite difference approximation is poor

Inferelator 1--- Limitations

tk tk+1 tk+2

Error over long time intervals

PredictError propagation

Wednesday, June 24, 2009

Finite difference approximation is poor

Inferelator 1--- Limitations

tk tk+1 tk+2

Error over long time intervals

Predict

Wednesday, June 24, 2009

Finite difference approximation is poor

Inferelator 1--- Limitations

tk tk+1 tk+2

Error over long time intervals

Wednesday, June 24, 2009

Finite difference approximation is poor

Inferelator 1--- Limitations

tk tk+1 tk+2

Wednesday, June 24, 2009

Predictorslevels are const.

Finite difference approximation is poor

Inferelator 1--- Limitations

tk tk+1 tk+2

Wednesday, June 24, 2009

Explicit global solutions using Metropolis-Hastings:

Calculate gradient of the Energy with respect to

E β( ) = 12

xi(tk ) − xiobs(tk )( )

i=1

M

∑k=1

K

∑2

(1)

∂E β( )∂β j

= xi(tk ) − xiobs(tk )( )

i=1

M

∑k=1

K

∑ ∂xi(tk )∂β j

(2)

β jn+1 = β j

n − h ∂E(βn )

∂β j

+ 2σh ∗ζ jn

(3)

Slope

SlopeStep

Temperature

~N(0,1)

Aviv Madar on Right

Work done with Eric Vanden-Eijnden,Courant Institute of Mathematical Sciences,NYU

Wednesday, June 24, 2009

Inferelator 2: Concepts

tk tk+1

Inject intermediate time points

Wednesday, June 24, 2009

Inferelator 2: Concepts

tk tk+1

Wednesday, June 24, 2009

Finite difference approximation is improved

Inferelator 2: Concepts

tk tk+1

Predictorslevels are not const.

Wednesday, June 24, 2009

Finite difference approximation is improved

Inferelator 2: Concepts

tk tk+1

Predictorslevels are not const.

Wednesday, June 24, 2009

Finite difference approximation is improved

Inferelator 2: Concepts

tk tk+1

Predictorslevels are not const.

Wednesday, June 24, 2009

Finite difference approximation is improved

Inferelator 2: Concepts

tk tk+1

How do we estimate parameters?

Predictorslevels are not const.

Wednesday, June 24, 2009

Markovchain

Importancesampling

Gaussian Noise term

Inferelator 2: Mathematical Overview

Error over time series data

Error over steady state data

L2 norm constraint/regularizer

Minimize Energy (scoring/objective function)

Markov Chain Monte Carlo (MCMC) scheme to sample parameters

Wednesday, June 24, 2009

Markovchain

Importancesampling

Gaussian Noise term

Inferelator 2: Mathematical Overview

Error over time series data

Error over steady state data

L2 norm constraint/regularizer

Markov Chain Monte Carlo (MCMC) scheme to sample parameters

Wednesday, June 24, 2009

Markovchain

Importancesampling

Gaussian Noise term

Inferelator 2: Mathematical Overview

Error over time series data

Error over steady state data

L2 norm constraint/regularizer

Wednesday, June 24, 2009

Inferelator 2: Gradient Approximation

Wednesday, June 24, 2009

Inferelator 2: L1-norm of Parameters

Wednesday, June 24, 2009

Inferelator 2: Performance 5

time interval, minutes ->

length of time interval

rela

tive

err

or

Length of Time Interval vs. Relative Error

0 10 20 30 40 50 60

0.4

.81.2

1.6

train set

test set

Wednesday, June 24, 2009

cspd2

VNG0040C

rhl

VNG0320H

3

kaic

AND

trh7

tbpe

AND

tbpd

4

tbpc

5

7

phou

AND

AND

VNG0293H

VNG2641H

AND

gvpe2

1H

VNG1029C

snp

VNG0194H

idr2

28

29

pai1

VNG2614H

4950

AND

AND

AND

52

AND

53

AND

78

tfbg

boa3

113

140

173

AND AND

mixed time scales / mixed data-types:Learn regulatory interactions from sub-optimal datasetsMixed signaling & regulatory netsAdding metabolic effects

Inferelator 2, more explicit dynamics:New proposal distributionsNew functional forms for interactionsTesting in a wider variety of systems

Stochastic bayes /SDE aproach:Estimate or measure system convergence as well as mean, model error,multiple system paths

II. The Inferelator: Future Directions

Wednesday, June 24, 2009

post-docs for protein design, prediction & network inference

Wednesday, June 24, 2009

AcknowledgmentsBonneau lab:Glenn ButterfossKevin DrewAviv MadarPeter WaltmanThadeous KacmarczykShailla MusharofDevorah KengmanaChris Poultny (Shasha)Irina NudelmanAlex Pearlman (Ostrer)Alex Pine

NYU:Eric Vanden-EijndenHarry OstrerMike PuruggananPatrick EichenbergerDennis Shasha

Tacitus- Howard Coale

• IBM– Robin Wilner

– Bill Boverman– Viktors Berstis– Rick Alther

• ETH Zurich

- Reudi Aebersold - Lars Malmstroem

Mike BoxemMarc Vidal

Dave Goodlett

Jochen Supper (Zell Lab)

- ISB

– Nitin Baliga (&lab)– Leroy Hood – Marc Facciotti

– David Reiss– Vesteinn Thorsson- Paul Shannon

- Iliana Avila-Campillo (MERC)

Alan Aderem

DOD-computing and society, NSF ABI, NSF Plant genome NSF DBI,DOE GTL

Rosetta CommonsCharlie Strauss (los alamos)David Baker (UW seattle)

Wednesday, June 24, 2009

1

Rosetta de novo structure prediction: The Human Proteomefolding Project

Kevin Drew,Lars Malmstroem,

Glenn Butterfoss,

Richard Bonneau

Rosetta Commons

Wednesday, June 24, 2009

4

Motivation: Genome AnnotationCheaper sequencing technologies

New protein sequences

Proteins w/ unknown function

Shibu Yooseph et al. 2007 Plos Biology

Wednesday, June 24, 2009

7

Background: Quick ExampleBacteriocin AS-48, Casp 4

1E68 1NKLGYFCESCRKIIQKLEDMVGPQPNEDTVTQAASQVCDKLKILRGLCKKIMRSFLRRISWDILTGKKPQAICVDIKICKE

MAKEFGIPAAVAGTVLNVVEAGGWVTTIVSILTAVGSGGLSLLAAAGRESIKAYLKKEIKKKGKRAVIAW 4%=

=

Cyclic Bacterial Lysin = NK Lysin

Structure:

Function:

Sequence:

Bonneau, R., Tsai, J., Ruczinski, I., Baker, D. Functional Inferences from Blind ab Initio Protein Structure Predictions. J. Structural Biology. (2001)

Wednesday, June 24, 2009

Plasmodium

SBRI top candidates for Vaccine

for preventing pregnancy

malaria

Wednesday, June 24, 2009

Arabidopsis example:

RPT3

1: cofactor

2: point of mutation causing

differential response to morphogen

1

2

Wednesday, June 24, 2009

E Andersen-Nissen, R Bonneau, R Strong, A AderemJournal of Experimental Medicine, 2007

Distant Multi-template fold recognition for Toll receptors

Wednesday, June 24, 2009

Lars Malmstroem

Kevin Drew

Wednesday, June 24, 2009

Rosetta

Local Sequence Bias

Non-local Interactions

Kevin Drew, Chivian, D., Bonneau, R. Ab initio structure prediction. (In) Bourne, P.E. (2007) Structural Bioinformatics (Methods of Biochemical Analysis, V. 44). New York: John Wiley & Sons; ISBN: 0471201995. Second Edition.

Wednesday, June 24, 2009

Rosetta

Local Sequence Bias

Non-local Interactions

CC

R

N

F Y

N

Hq

HH d

Experimental, Evolution

Kevin Drew, Chivian, D., Bonneau, R. Ab initio structure prediction. (In) Bourne, P.E. (2007) Structural Bioinformatics (Methods of Biochemical Analysis, V. 44). New York: John Wiley & Sons; ISBN: 0471201995. Second Edition.

Wednesday, June 24, 2009

Rosetta Fragment Libraries

• 25-200 fragments for each/every 3 and 9 residue sequence window (overlapping)

• Selected from database of known structures > 2.5Å resolution < 50% sequence identity

• Ranked by sequence similarity and similarity of predicted and known secondary structure

• Fragments restrict search to protein-like local conformations

Sequence similar or exact sequence BUT not long enough that the similarity is attributable to evolution (not homologous strictly speaking).

Wednesday, June 24, 2009

χ

χ

Low resolution:

Atom Model centroid reduction of side chains

Energy function terms

van der Waals repulsion

“pair” terms (electrostatics)

residue environment (prob of burial)

2º structure pairing terms (H-bonds)

radius of gyration

packing density

Implicit terms

fragments (local interactions)

Wednesday, June 24, 2009

Low resolution:

Atom Model centroid reduction of side chains

Energy function terms

van der Waals repulsion

“pair” terms (electrostatics)

residue environment (prob of burial)

2º structure pairing terms (H-bonds)

radius of gyration

packing density

Implicit terms

fragments (local interactions)

d

E

CLASH!!

(rij2 − dij

2 )2

rijj< i∑

i∑ ;dij < rij

d = distance

r = radii∑

Evaluate between Centoids and Backbone Atoms

Wednesday, June 24, 2009

Low resolution:

Atom Model centroid reduction of side chains

Energy function terms

van der Waals repulsion

“pair” terms (electrostatics)

residue environment (prob of burial)

2º structure pairing terms (H-bonds)

radius of gyration

packing density

Implicit terms

fragments (local interactions)

Pair-wise probability based on PDB statistics(electrostatics)

−lnP(aai,aa j | sijdij )

P(aai | sijdij )P(aai | sijdij )

⎣ ⎢

⎦ ⎥

j> i∑

i∑

aa = residue typed = centroid distance (binned, interpolated)s = sequence seperation (must be > 8 res )

Wednesday, June 24, 2009

Low resolution:

Atom Model centroid reduction of side chains

Energy function terms

van der Waals repulsion

“pair” terms (electrostatics)

residue environment (prob of burial)

2º structure pairing terms (H-bonds)

radius of gyration

packing density

Implicit terms

fragments (local interactions)

neighbors within 10 Å of Cβ

binned by : 0-3, 4,5, … , >30

also interpolated�

−ln P(aai | neighborsi)[ ]i∑

Probability of burial /exposure(solvation)

Wednesday, June 24, 2009

Low resolution:

Atom Model centroid reduction of side chains

Energy function terms

van der Waals repulsion

“pair” terms (electrostatics)

residue environment (prob of burial)

2º structure pairing terms (H-bonds)

radius of gyration

packing density

Implicit terms

fragments (local interactions)

Optimize 2º orientation

Wednesday, June 24, 2009

Low resolution:

Atom Model centroid reduction of side chains

Energy function terms

van der Waals repulsion

“pair” terms (electrostatics)

residue environment (prob of burial)

2º structure pairing terms (H-bonds)

radius of gyration

packing density

Implicit terms

fragments (local interactions)

N

R1

R2

C

O

Represent protein as vectors of2 residue “strands”

sheet vector

helix vector

Wednesday, June 24, 2009

Low resolution:

Atom Model centroid reduction of side chains

Energy function terms

van der Waals repulsion

“pair” terms (electrostatics)

residue environment (prob of burial)

2º structure pairing terms (H-bonds)

radius of gyration

packing density

Implicit terms

fragments (local interactions)

Coordinate system

v1

v2

θ

φ

σ

r

hb

Scores selected to discriminate “near native structures for “non native”:

Relative direction (φ,θ)

Relative H-bond orientation (hb)

Distance (r, rσ)

Number of sheets given number of strands

Helix-Strand Packing

Wednesday, June 24, 2009

Low resolution:

Atom Model centroid reduction of side chains

Energy function terms

van der Waals repulsion

“pair” terms (electrostatics)

residue environment (prob of burial)

2º structure pairing terms (H-bonds)

radius of gyration

packing density

Implicit terms

fragments (local interactions)

Used in earlier stages and for filtering�

RG = dij2

Density = −lnPcompact (neighborsi,sh )Prandom (neighborsi,sh )

⎣ ⎢

⎦ ⎥

sh∑

i∑

Promote a compact fold

Wednesday, June 24, 2009

χ

χ

High resolution:

Atom Model full atom representation

Energy function terms

Rotamer (Dunbrack)

Ramachandran

Solvation (Lazaridius Karplus)

Hydrogen bonding

Lennard-Jones

Pair (electrostatic)

Reference energies

Wednesday, June 24, 2009

9

Process of Obtaining Structures1. Split proteins into domains (ginzu,

Chivian) (chop, Rost)

2. Find domains we can annotate using Rosetta

3. Fold remaining domains using Rosetta on IBM’s World Community Grid

• 180,000 domains folded from 120 genomes

BIG caveat emptor:all results from this pointfor domains < 170 aa

Wednesday, June 24, 2009

worldcommunitygrid.org & grid.org

collaborators: Lars Malmstroem, Viktors Berstis, Mike Riffle, Leroy Hood, David Baker

Wednesday, June 24, 2009

Bacterial and Archaea:Bonneau, & Baliga. (2004)Genome Biology:Annotaion of Halobacterium NRC-1identification of transcription factorsrole of chemotaxis sensing domains

Yeast:Malstroem, Baker, Bonneau (2006) Plos Biology

Human & others:Bonneau, Malstroem, IBM: Human and others (in Progress)

Completed and ongoing projects

Wednesday, June 24, 2009

12

overview of approach

Wednesday, June 24, 2009

13Molecular Function GO Tree

P( MF | Predicted Structure, GO Process, GO Localization, … )

specificity

overview of approach

Wednesday, June 24, 2009

15

1. Structure (MCM) Score

2. Training Set (attaching structure to GO)

3. Naïve Bayes• Naïve Bayes with continuous SF prob• Naïve Bayes with GO terms

overview of approach

Wednesday, June 24, 2009

16

Mammoth Confidence Metric (MCM)• Compare Cluster Representatives

to PDB Structures• MCM Score [0…1] probability • based on:

– Quality of match– Rosetta quality– Length ratio of PDB and cluster

rep– Contact Order

Wednesday, June 24, 2009

17

Gene Ontology (GO) & Training Data • Function, Process, Localization terms

• 1.6 million sequences with annotations•

BLAST astral sequences to GO sequences (astral = pdb w/ SCOP SF)

• Cluster using CD-hit to reduce redundancy

• Cluster again using genome of benchmark sequences and remove matches

Wednesday, June 24, 2009

18

Naïve Bayes

y = molecular function and x = {sf, bp, cc}

Wednesday, June 24, 2009

19

Naïve Bayes w/ Superfamilies

• How to take continuous probabilities of SF (by way of mcm scores) – we weight log-likelihood by the mcm scores:

Wednesday, June 24, 2009

20

Naïve Bayes w/ GO terms• Problem: Go terms are not

independent – if we use all terms annotated to a sequence we end up

double counting

• Solution: pick a term that will be predictive– Mutual information between term and MF

Wednesday, June 24, 2009

22

Results: Solved StructuresHow accurate are we when we predict SCOP Superfamily for PDB Structures?

Wednesday, June 24, 2009

23

Results: Since Solved Structures (2005)How accurate are we when we predict SCOP Superfamily for Swissprot Proteins?

P(s|rosetta)

Wednesday, June 24, 2009

24

Results: Bayes Function Prediction (Swissprot Benchmark )How accurate are our function predictions using structure only?

77 unique molecular functions100 unique molecular functions

P(MF|rosetta)

Wednesday, June 24, 2009

25

Results: Bayes Function Prediction (Swissprot Benchmark )How accurate are our function predictions using GO process & structure?

P(MF|rosetta,P)

Wednesday, June 24, 2009

VAM6/ YDL077C: Vacuolar protein that plays a critical role in the tethering steps of vacuolar membrane fusion by facilitating guanine nucleotide exchange on small guanosine triphosphatase Ypt7p.We find the following: (domain1) unknown (domain 2) Rosetta hit to Polynucleotide phosphorylase/guanosine pentaphosphate synthase (PNPase/GPSI), domain 3 (domain 3) Clathrin proximal leg (domain 4) Rosetta de novo hit to Hemerythrin (domain 5) Rosetta hit to SAM/ Pointed domain.

VPS29: Endosomal protein that is a subunit of the membrane-associated retromer complex essential for endosome-to-Golgi retrograde transport; forms a subcomplex with Vps35p and Vps26p that selects cargo proteins for endosome-to-Golgi retrieval. But, with this context so well defined, there is still no molecular function known, that is to say there is no precise mechanistic role known for this protein. We find a strong hit to Mre11 ( a double stranded mismatch repair protein, metal dependent phosphotase for domain 1 and a strong Rosetta hit for domain 2 to the PUA-domain like fold (implicated in RNA binding OR ATP sulfurylase N-terminal domain). The fold predictions are as confident as we ever see (MCM = 0.95, psiblast evalue to domain 1 hit Z = 13. ).

Results: HPF:vesicle transport

Wednesday, June 24, 2009