prediction of protein contact maps piero fariselli department of biology university of bologna

Post on 13-Jan-2016

240 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Prediction of protein contact maps

Prediction of protein contact maps

Piero Fariselli

Department of BiologyUniversity of Bologna

From Sequence to Function From Sequence to Function Functional Genomics and ProteomicsFunctional Genomics and Proteomics

Genomic sequence

s

>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

Protein sequences Protein structures

Protein functionsProtein functions

The Protein Folding

T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N

(Rost B.) http://dodo.cpmc.columbia.edu/cubic/papers/

The Data Bases of Sequences and Structures

>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

EMBL: 195,241,608 sequences 292,078,866,691 nucleotides

UNIPROT: 428 650 sequences 154'416'236 residues

PDB: 68000 structures membrane proteins 1%

November/2009

What is a multiple alignment ? The short answer is this -

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

sequence position

Evolutionary information

•Multiple Sequence Alignment (MSA) of similar sequences

•Sequence profile: for each position a 20-valued vector contains the aminoacidic composition of the aligned sequences.

MS

ASe

quen

ce p

rofi

le

New folds Existing folds

ThreadingAb initio

prediction

Building by homology

Homology (%)

0 10 20 30 40 50 60 70 80 90 100

3D structure prediction of proteins

Contacts and Contact MapsContacts and Contact Maps

Contact definition

F 297

F 156 V 299

V 271

I 240V 238

I 269

Protein contact definitions:

1. Based on C2. Based on C3. All-atom (without Hydrogens)

From the 3D structure to the contact map

Given a protein of length L, and a square matrix M of dimension L L

For each pair of residue i and j

calculate distance between i and j

if distance < threshold

put 1 in the cell M(i,j)

otherwise

put 0 in the cell M(i,j)

From 3D Structure

F 297

F 156 V 299

V 271

I 240V 238

I 269

Computation of Contact MapsComputation of Contact Maps

To Contact MapTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYANT

TCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Protein Structural Classes

All- All-

+ /

An Example of a Contact map (All-)

1

2

3

4

1C5A

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

2

1

3

4

An Example of a Contact map (All-)

1SFP

0

20

40

60

80

100

120

0 20 40 60 80 100 120

NC

An Example of Contact map ()

N

C

6PTI

0

10

20

30

40

50

60

0 10 20 30 40 50 60

From the contact map to the 3D structure

Two methods have been proposed :

1. Bohr et al., “Protein Structure from distance Inequalities” J.Mol. Biol. 1993, 231:861-869 => based on a steepest descent procedure

2. Vendruscolo and Domany Fold. Des. 1998, 2:295-306=> based on a modified Metropolis procedure

6pti Reconstruction Efficiency (58 residues)

At M= 200 No of eliminated true contacts 6 % real contacts No of added false contacts 52 % real contacts

RMSD

M (Number of random flipping)

Vendruscolo and Domany Fold. Des. 1998

From the contact map to the 3D structure:

the reconstruction efficiency

RMSD = 2.5 Å

N

C

Contact map

1QHJ (1.9 Å)

3-D Modelling through Contact Maps example: Bacteriorhodopsin3-D Modelling through Contact Maps example: Bacteriorhodopsin

Model

0.000

1.000

2.000

3.000

4.000

5.000

0.0 20.0 40.0 60.0 80.0 100.0

% missing contacts

RM

SD

MARC efficiency in 3D reconstruction from the protein contact map after progressive

elimination of true contacts (6pti)

0.0001.0002.0003.0004.0005.0006.0007.000

0.0 10.0 20.0 30.0 40.0

% wrong contacts

RM

SD

MARC efficiency in 3D reconstruction after progressive addition of wrong contacts to a

protein contact map with 30 % of true contacts (6pti)

Prediction of Contact Maps

Prediction of Contact Maps

Several methods have been applied:

Bohr et al., FEBS 1990 261:43-46=> based on neural networks

Göbel et al., PROTEINS 1994 18: 309-317=> based on correlated mutations in proteins

Thomas et al., Prot. Eng. 1996 9: 941-948=> based on a statistical method and evolution information Olmea and Valencia Fold. Des. 1997 2:S25-S32 => based on correlated mutations and other information Fariselli and Casadio Prot. Eng 1999 12:15-21=> based on neural networks and evolutionary information Fariselli et al., CASP4/ and Prot. Eng. in press=> Neural networks and other information

Pollastri and Baldi al., Bioinformatics 2002 18 S62-S70=> Recurrent Neural networks

Relevant points

• Contact Threshold

• Sequence separation (or sequence gap)• No of contacts vs No of non-contacts

The Contact Threshold

16 Å

0

10

20

30

40

50

0 10 20 30 40 50

The Contact Threshold

16 Å

12 Å

0

10

20

30

40

50

0 10 20 30 40 50

The Contact Threshold

16 Å

12 Å

8 Å

0

10

20

30

40

50

0 10 20 30 40 50

The Contact Threshold

16 Å

12 Å

8 Å

6 Å

0

10

20

30

40

50

0 10 20 30 40 50

Sequence separation

1

100

20

40

…VTISCTGSSSNIGAGNHVKWYQQLPG…

The Sequence Separation

0

10

20

30

40

50

60

0 10 20 30 40 50 60

example of a sequence separation = 10

residues

2

Frequency distribution of the real and hypothetical contacts as a function of sequence separation

Protein length

Num

ber

of c

onta

cts

Relation between the number of contacts and the protein length

Evaluation of the efficiency of contact map predictions

1) Accuracy:

A = Ncp* / Ncpwhere Ncp* and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively.

2) Improvement over a random predictor :

R = A / (Nc/Np)

where Nc/Np is the accuracy of a random predictor ; Nc is the number of real contacts in the protein of length Lp, and Np are all the possible contacts

3) Difference in the distribution of the inter-residue distances in the 3D structure for predicted pairs compared with all pair distances in the structure (Pazos et al., 1997):

Xd= i=1,n (Pic - Pia ) / n di

where n is the number of bins of the distance distribution (15 equally distributed bins from 4 to 60Å cluster all the possible distances of residue pairs observed in the protein structure); di is the upper limit (normalised to 60 Å) for each bin, e.g. 8 Å for the 4 to 8 Å bin; Pic and Pia are the percentage of predicted contact pairs (with distance between di and di-1 ) and that of all possible pairs respectively

PredictionNew sequence

Prediction

Tools out of machine learning approaches

Tools out of machine learning approaches Neural NetworksNeural Networks

Data Base Subset

General

rules

Known mapping

TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Training

Contact definition used:

• C- C distance < 0.8 nm

• Sequence gap > 7 residues

L<100 1c5a 1sco 2sn3 1bkf 1npk 3lzt 1juk 1axn1a1i_A 1cfh 1spy 2sxl 1bkr_A 1pdn_C 3nul 1kid 1b0m1a1t_A 1ctj 1sro 3gat_A 1br0 1pkp 5p21 1mml 1bg21a68 1cyo 1tbn 3mef_A 1bsn 1poa 7rsa 1mrj 1bgp1a7i 1fna 1tiv 4mt2 1bv1 1put L: 170-2991nls 1bxo1acp 1hev 1tle 5pti 1bxa 1ra9 1ad2 1ppn 1dlc1ah9 1hrz_A 1tsg L: 100-169 1c25 1rcf 1akz 1rgs 1irk1aho 1kbs 1ubi 1a62 1cew_I 1rie 1amm 1rhs 1iso1aie 1mbh 1uxd 1a6g 1cfe 1skz 1aol 1thv 1kvu1ail 1mbj 2acy 1acz 1cyx 1tam 1ap8 1vin 1moq1ajj 1msi 2adx 1asx 1dun 1vsd 1bf8 1xnb 1svb1aoo 1mzm 2bop_A 1aud_A 1eca 1whi 1bjk 1yub 1uro_A1ap0 1nxb 2ech 1ax3 1erv 2fsp 1byq_A 1zin 1ysc1ark 1ocp 2fdn 1b10 1exg 2gdm 1c3d 2baa 2cae1awd 1opd 2fn2 1bc4 1hfc 2ilk 1cdi 2fha 2dpg1awj 1pce 2fow 1bd8 1ifc 2lfb 1cne L>300 2pgd1awo 1plc 2hfh 1bea 1jvr 2pil 1cnv 16pk 3grs1bbo 1pou 2hoa 1bfe_A 1kpf 2tgi 1csn 1a8e1bc8_C 1ppt 2hqi 1bfg 1kte 2ucz 1ezm 1ads1brf 1rof 2lef_A 1bgf 1mak 3chy 1fts 1arv

The database of proteins used to train and test the contact map predictors.

Neural Network-based predictor

• 1 output neuron (contact/non-contact)

• 1 hidden layer with 8 neurons

• Input layer with 1071 input neurons :• Ordered residue pairs (1050 neurons)

• Secondary structures (18 neurons)

• Correlated mutations (1 neuron)

• Sequence conservation (2 neurons)

(A) An alignment of 5 (hypothetical) sequences they are represented in a HSSP file (Sander and Schneider, 1991). i and j stand for the positions of the two residues making or not making contact (A and D in the leading sequence or sequence 1). (B) Single sequence coding. The position representing the couple (AD) in the vector is set to 1.0 while the other positions are set to 0. (C) Multiple sequence coding. For each sequence in the alignment (1 to 5 in the scheme in A) a couple of residues in position i and j is counted. The final input coding representing the frequency of each couple in the alignment is normalized to the number of the sequences

Representation of the input coding based on ordered couples.

Multiple sequence alignment1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT

N s

eque

nces

S(T;S)

S(T;T)

S(S;T)

S(I;L)

S(I;V)

S(L;V)

S : McLachlan substitution matrix

ViVj

M-valued vectors:

Correlation:

M

kij M

C1 ji

jjii

VσVσ

V(k)VV(k)V1

Correlated mutations

i j

M =

N·(

N-1

)/2

co

uple

s

1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST

1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT

2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT

The neural network architecture for prediction of contact maps

0

5

10

15

20

25

30

35

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7

Accuracy of contact map prediction using a cross-validated data set (170 proteins)

Accuracy

No

of p

rote

ins

T0087: 310 residues (A = 0.20 FR/NF )

N

C

N

C

T0106: 123 residues (A=0.06 FR / NF )

N

C

T0128: 222 residues (A = 0.24 CM )

T0110: 128 residues (A = 0.30 FR )

N

C

N

C

T0125: 141 residues (A = 0.03 CM )

C

N

T0124: 242 residues (A = 0.01 NF)

0

50

100

150

200

250

300

0 100 200 300

TARGET: T0115 (300 residues) (A = 0.17 FR/NF)PDB code: 1FWK (Homoserine kinase, Methanococcus jannaschii)

C

N

Sequence position

Seq

uenc

e po

siti

on

Predictive performance on 29 targets

Target Q3 (SS)Predicted

Fr(H)Predicted

Fr(E)Observed

Fr(H)Observed

Fr(E) Lp Nal Xd A Class or T0087 0.777 0.453 0.204 0.401 0.207 310 15 10.5 0.20 FR/NFT0089 0.762 0.333 0.328 0.397 0.331 419 37 6.6 0.10 CM/FR/NFT0094 0.701 0.441 0.254 0.294 0.356 181 2 1.4 0.04 FR/NFT0099 0.625 0.000 0.411 0.000 0.071 56 358 6.0 0.25 CM/FRT0100 0.664 0.117 0.456 0.088 0.377 342 132 9.2 0.11 FRT0101 0.718 0.040 0.403 0.045 0.290 400 8 4.3 0.07 FRT0105 0.649 0.181 0.298 0.202 0.213 94 43 0.5 0.08 FR/NFT0107 0.803 0.016 0.463 0.074 0.463 188 16 7.8 0.14 FRT0108 0.754 0.006 0.475 0.078 0.374 206 9 9.4 0.18 FRT0109 0.819 0.527 0.165 0.451 0.159 182 19 7.7 0.18 FRT0110 0.832 0.474 0.326 0.451 0.159 128 27 8.3 0.30 FRT0111 0.742 0.493 0.142 0.428 0.177 431 222 7.2 0.08 CMT0112 0.701 0.213 0.365 0.316 0.270 352 704 5.8 0.17 CM/FRT0115 0.821 0.412 0.250 0.375 0.260 300 45 7.5 0.17 FR/NFT0116 0.793 0.520 0.162 0.464 0.194 811 165 7.5 0.09 FR/NFT0121 0.796 0.285 0.336 0.304 0.349 372 1000 14.7 0.16 CM/FRT0122 0.888 0.515 0.170 0.527 0.166 248 103 13.2 0.19 CMT0123 0.681 0.263 0.300 0.138 0.425 160 70 12.9 0.25 CMT0126 0.704 0.420 0.216 0.241 0.370 163 8 7.9 0.12 FRT0127 0.750 0.512 0.154 0.509 0.142 350 70 9.2 0.14 FRT0128 0.801 0.526 0.123 0.493 0.133 222 551 20.2 0.24 CMall-T0114 0.747 0.000 0.529 0.000 0.483 87 1 7.3 0.26 FRall-T0096 0.874 0.712 0.036 0.757 0.045 239 22 2.7 0.07 FR/NFT0097 0.846 0.779 0.000 0.663 0.000 105 7 2.4 0.08 FR/NFT0098 0.731 0.647 0.109 0.714 0.000 121 36 0.1 0.07 NFT0102 0.457 0.357 0.414 0.814 0.000 70 2 1.2 0.14 FRT0106 0.728 0.320 0.128 0.408 0.056 128 123 2.3 0.06 FR/NFT0124 0.868 0.979 0.000 0.855 0.000 242 429 -5.0 0.01 NFT0125 0.803 0.679 0.066 0.723 0.015 141 5 0.0 0.03 CM

Q3=secondary structure prediction accuarcy; Fr(H) and Fr(E)= frequency of predicted and observed alfa and beta structures in the chain;Lp=protein length in residues; Nal= number of sequences in the alignment; Xd and A are as defined in equations 2 and 1, respectively;Class is the classification of targets by predictio difficulty: CM=comparative modeling, FR=fold recognition, NF=new fold.

COMMENTS

• The predictor is trained mainly on globular mixed proteins

• Contacts among beta structures dominate

• Contacts in all-alpha proteins are more difficult to predict

• A filtering algorithm is needed

top related