prediction of structural and functional features in proteins startin … neurali... · 2012. 9....

Prediction of structural and functional features in proteins startin from the features in proteins starting from the

residue sequence

INTRODUCTION TO NEURAL NETWORKSNETWORKS

MAPPING PROBLEMS: Secondary structure

Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

3D structure

.. .... . ...........

3D structure

Ct

MAPPING PROBLEMS: Topology of transmembrane proteins

position of Trans Membrane Segments along the sequenceTopographyALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK

proteins

Outer Membrane Inner Membrane-barrel -helices

Outer Membrane Inner Membrane

Porin (Rhodobacter capsulatus)

Bacteriorhodopsin(Halobacterium salinarum)

First generation methodsFirst generation methodsSingle residue statisticsSingle residue statistics

Propensity scales

F h id For each residue

•The association between each residue and the different The association between each residue and the different features is statistically evaluated

•Physical and chemical features of residues

A it l f t t b i t d A propensity value for any structure can be associated to any residue

HOW?

Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale

Gi s t f k st t s t h mGiven a set of known structures we can count how manytimes a residue is associated to a structure.

Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh

N(A h) 7 N(A c) 1 N 40N(A,h) = 7, N(A,c) =1, N= 40

P(A h) = 7/40 P(A h) = 1/40P(A,h) 7/40, P(A,h) 1/40

Is that enough for estimating a propensity?


Gi s t f k st t s t h mGiven a set of known structures we can count how manytimes a residue is associated to a structure.

Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh

N(A h) 7 N(A c) 1 N 40N(A,h) = 7, N(A,c) =1, N= 40

P(A h) = 7/40 P(A h) = 1/40P(A,h) 7/40, P(A,h) 1/40

We need to estimate how much independent the residue-to-structure association is.

P(h) = 27/40 P(c) = 13/40P(h) = 27/40, P(c) = 13/40


Given a set of known structures we can count how manyGiven a set of known structures we can count how manytimes a residue is associated to a structure.

Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALh h h h h hhhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh

N(A h) = 7 N(A c) =1 N= 40N(A,h) = 7, N(A,c) =1, N= 40

P(A,h) = 7/40, P(A,h) = 1/40( , ) , ( , )

P(h) = 27/40, P(c) = 13/40

If the structure is independent of the residue:P(A h) = P(A)P(h)P(A,h) = P(A)P(h)

The ratio P(A,h)/P(A)P(h) is the propensity

Gi LARGE s t f x m l s sit l b


Given a LARGE set of examples, a propensity value can becomputed for each residue and each structure type

Name P(H) P(E) Alanine 1,42 0,83Arginine 0,98 0,93Aspartic Acid 1,01 0,54Asparagine 0,67 0,89Cysteine 0,70 1,19Glutamic Acid 1,51 0,37Glutamine 1,11 1,10Glycine 0,57 0,75Histidine 1,00 0,87Isoleucine 1,08 1,60, ,Leucine 1,21 1,30Lysine 1,14 0,74Methionine 1,45 1,05Phenylalanine 1 13 1 38Phenylalanine 1,13 1,38Proline 0,57 0,55Serine 0,77 0,75Threonine 0,83 1,19Tryptophan 1 08 1 37Tryptophan 1,08 1,37Tyrosine 0,69 1,47Valine 1,06 1,70

Gi s s d st t di ti


Given a new sequence a secondary structure prediction canbe obtained by plotting the propensity values for eachstructure residue by residuestructure, residue by residue

T S P T A E L M R S T GT S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75

Considering three secondary structures (H,E,C), the overall accuracy, as evaluated on an uncorrelated set of sequences with known structure, is very lowQ3 = 50/60 %

http://www expasy ch/cgi-bin/protscale pl


http://www.expasy.ch/cgi bin/protscale.pl

Transmembrane alphaTransmembrane alpha--helices: Kytehelices: Kyte--Doolittle scaleDoolittle scale

It is m t d t ki i t sid ti th t l tIt is computed taking into consideration the octanol-waterpartition coefficient, combined with the propensity of theresidues to be found in known transmembrane helicesresidues to be found in known transmembrane helices

Ala: 1.800 Arg: -4.500 Asn: -3.500 Asp: -3.500Asn: 3.500 Asp: 3.500 Cys: 2.500 Gln: -3.500 Glu: -3.500 Gly: -0.400 His: -3.200 Ile: 4.500 Leu: 3.800 Lys: -3.900 Met: 1.900 Phe: 2.800 Pro: -1.600 Ser: -0.800 Thr: -0.700 Trp: -0.900 Tyr: -1.300 Val: 4.200

Second generation methods: GORSecond generation methods: GOR

The structure of a residue in a protein strongly depends on the sequence contexton the sequence context

It is possible to estimate the influence of a residue in determining the structure of a residue close along the sequence. Usually windows from -8/8 to -13/13 are consideredconsidered.

Coefficients P(A s i) estimate the contribution of the Coefficients P(A,s,i) estimate the contribution of the residue A in determining the structure s for a residue that is i positions apart along the sequence

Struttura secondaria: Metodo GORStruttura secondaria: Metodo GOR

Q3 = 65 % (Considering three secondary structures (H,E,C), and evaluating the overall accuracy on an uncorrelated set of sequences with known structure)

The contribution of each position in the window is independent of the other ones No correlation among the independent of the other ones. No correlation among the positions in the window is taken in to account.

A more efficient method: Neural NetworksA more efficient method: Neural Networks

Alternative computing algorithm: analogies with theAlternative computing algorithm: analogies with thecomputation in the nervous system.

1) The nervous systems is constituted of elementarycomputing units: neurons2) The electric signal flows in a determined direction2) The electric signal flows in a determined direction(dentrites->axon) (Principle of dynamic polarization)3)There is not cytoplasmic continuity among the neurons.3)There is not cytoplasmic continuity among the neurons.Each neuron specifically communicates with someneighboring neurons by means of synapses (Principle ofconnective specificity)

Tools out of machine learning approachesTools out of machine learning approachesNeural Networks can learn the mapping fromsequence to secondary structureNeural Networks can learn the mapping fromsequence to secondary structure

PredictionN

q yq y

Data Base SubsetTraining

New sequenceData Base SubsetTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

General lrules

PredictionEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE PredictionKnown mapping

Neural network for secondary structure predictionNeural network for secondary structure prediction

Output

C

Output

Inputp

M P I L K QK P I H Y H P N H G E A K GA 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 0 0 0 0ll C 0 0 0 0 0 0 0 0 0D 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0 0 0 0G 0 0 0 0 0 0 0 0 0

Usually:Input 17-23 residues

H 0 0 0 1 0 1 0 0 1I 0 0 1 0 0 0 0 0 0K 1 0 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0

Hidden neurons :4-15

N 0 0 0 0 0 0 0 1 0P 0 1 0 0 0 0 1 0 0Q 0 0 0 0 0 0 0 0 0R 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 0 0 0 0T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0W 0 0 0 0 0 0 0 0 0Y 0 0 0 0 1 0 0 0 0

D (L)

R (E)

ACDEF

Q (E)

G (E)

( )FGHIKLM

HF (E)

V (E)

P (E)MNPQRS

E

L

P (E)

A (H)

A (H)TVWY.

A (H)

Y (H)

V (E)V (E)

K (E)

K (E)( )

Third generation methods: evolutionary informationThird generation methods: evolutionary information

1 Y K D Y H S - D K K K G E L - -2 Y R D Y Q T - D Q K K G D L - -3 Y R D Y Q S - D H K K G E L - -4 Y R D Y V S - D H K K G E L - -5 Y R D Y Q F - D Q K K G S L - -Q Q6 Y K D Y N T - H Q K K N E S - -7 Y R D Y Q T - D H K K A D L - -8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K10 T K G Y G F G L I K N T E T T K

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0

Position

E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

The Network Architecture for Secondary Structure PredictionThe Network Architecture for Secondary Structure PredictionSecondary Structure PredictionSecondary Structure Prediction

The First Network (Sequence to Structure)The First Network (Sequence to Structure)

H E C

CCHHEHHHHCHHCCEECCEEEEHHHCC

SeqNo No V L I M F W Y G A P S T C H R K Q E N D

1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 803 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 637 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 09 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0

10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 011 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 013 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0

The Network Architecture for Secondary Structure PredictionThe Network Architecture for Secondary Structure PredictionSecondary Structure PredictionSecondary Structure Prediction

The Second Network (Structure to Structure)The Second Network (Structure to Structure)H E C

CCHHEHHHHCHHCCEECCEEEEHHHCC

SeqNo No V L I M F W Y G A P S T C H R K Q E N D

1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 803 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 637 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 09 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0

10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 011 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 013 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0

The Performance on the Task of S d S P di iThe Performance on the Task of S d S P di i

The cross validation procedureThe cross validation procedure

Secondary Structure PredictionSecondary Structure Prediction

Protein set

The cross validation procedureThe cross validation procedure

Testing set 1

Training set 1

Efficiency of the Neural Network-Based Predictors onthe 822 Proteins of the Testing Set

INPUTQ3 (%) 66.3

Single SOV 0.62Sequence Q[H] 0 69 Q[E] 0 61 Q[C] 0 66Sequence Q[H] 0.69 Q[E] 0.61 Q[C] 0.66

P[H] 0.70 P[E] 0.54 P[C] 0.71C[H] 0.54 C[E] 0.44 C[C] 0.45

Q3(%) 72 4Q3(%) 72.4Multiple SOV 0.69Sequence Q[H] 0.75 Q[E] 0.65 Q[C] 0.75(MaxHom) P[H] 0 77 P[E] 0 64 P[C] 0 73(MaxHom) P[H] 0.77 P[E] 0.64 P[C] 0.73

C[H] 0.64 C[E] 0.54 C[C] 0.53Q3(%) 73.4

Multiple SOV 0 70Multiple SOV 0.70Sequence Q[H] 0.75 Q[E] 0.70 Q[C] 0.73(PSI-BLAST) P[H] 0.80 P[E] 0.63 P[C] 0.75

C[H] 0 67 C[E] 0 56 C[C] 0 53C[H] 0.67 C[E] 0.56 C[C] 0.53

Combinando differenti reti: Q3 =76/78%

Secondary Structure PredictionSecondary Structure Prediction

From sequenceFrom sequence

TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

EEEE HHHHHHHHHHHH HHHHHHHH EEEE

To secondary structureTo secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

And to the reliability of the predictionAnd to the reliability of the prediction

7997688899999988776886778999887679956889999999

P di tP t i B kh d R (C l bi U i )

SERVERSSERVERSPredictProtein Burkhard Rost (Columbia Univ.)

http://cubic.bioc.columbia.edu/predictprotein/

P iPRED D id J (UCL)PsiPRED David Jones (UCL)http://bioinf.cs.ucl.ac.uk/psipred/

JPred Geoff Barton (Dundee Univ.)

SecPRED http://www.biocomp.unibo.it

QEALEIA

Chamaleon sequencesChamaleon sequences

QEALEIA

GIKSKQEALEIAARRN FNPQTQEALEIAPSVGV

Translation Initiation Factor 3

……GIKSKQEALEIAARRN……

Transcription Factor 1

……FNPQTQEALEIAPSVGV……

Translation Initiation Factor 3Bacillus stearothermophilus

Transcription Factor 1Bacteriophage Spo1

1WTUA

1TIF

We extract: We extract:

from a set of 822 non-homologous proteins

2 452 5 h l s

(174,192 residues)2,452 5-mer chameleons

107 6-mer chameleons16 7-mer chameleons

1 8-mer chameleon

2,576 couples, p

The total number of residues in chameleons is 26,044 out of 755 protein chains (~15%)

Prediction of the Secondary Structure of Prediction of the Secondary Structure of Chameleon sequences with Neural NetworksChameleon sequences with Neural Networks

QEALEIAHHHHHHH

QEALEIACCCCCCC

C C

HHHHHHH CCCCCCC

NGDQLGIKSKQEALEIAARRNLDLVLVAP ARKGFNPQTQEALEIAPSVGVSVKPGNGDQLGIKSKQEALEIAARRNLDLVLVAP ARKGFNPQTQEALEIAPSVGVSVKPG

The Prediction of Chameleons with Neural N t k

The Prediction of Chameleons with Neural N t kNetworksNetworks

Method Performance on the Performance onMethod Performance on theProtein data set

Performance onChameleon sequences

NN with MSA Input 73.4 % 75.1 %pNN with SS Input 66.3 % 58.9 %GOR IV 64.4% 55.2 %

Other neural networkOther neural network--based predictorsbased predictors

•Secondary structure

•Topology of transmebrane proteins

•Cysteine bonding state

•Contact maps of proteins

•Interaction sites on protein surface

Prediction of the cysteine bonding statePrediction of the cysteine bonding state

T d i I f C i hidi f i l (1QK8)Tryparedoxin-I from Crithidia fasciculata (1QK8)MSGLDKYLPGIEKLRRGDGEVEVKSLAGKLVFFYFSASWCPPCRGFTPQLIEFYDKFHES KNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQKLSKHFNVESIPTLIGVDADSGKNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQKLSKHFNVESIPTLIGVDADSG DVVTTRARATLVKDPEGEQFPWKDAP

Cys68

Free cysteines

y

Cys40

Disulphide bonded cysteines

y

Cys43

A l t k b d A l t k b d A neural network-based method for predicting

A neural network-based method for predicting method for predicting

the disulfide connectivity method for predicting

the disulfide connectivity yin proteins

yin proteins

The Protein Folding

T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N

The Protein FoldingRPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA

Disulfide bonds Disulfide bonds

SS

SC

CSC CC

2-SH -> -SS- + 2H+ + 2e-

S-S distance 2.2 Å

T si n n l C S S C 90°Torsion angle C-S-S-C 90°

Bond Energy 3 Kcal/mol

Intra-chain disulfide bonds in proteins

Of 1259 proteins (a non redundant PDB subset):

• 23% of the chainshave disulfide bonds (S S)

• SS distribution (between secondary structures)secondary structures)

% H E CH 7 9 14E 17 27C 26

Intra-chain disulfide bonds in proteins

Distribution of disulfide bonds in the SCOP domains

•99 % of the disulfide bonds are intra-domain

•Distribution:Type %AllAll- 13All- 31 / 11 / 11 + 13Small domains 29Others 3

Problem no 1:

f h

Problem no 1:

Starting from the protein sequence can we discriminate whether a cysteine residue is

d lf d b d d?disulfide-bonded?

Prediction of the disulfide-bonding f i i istate of cysteines in proteins

Perceptron (input: sequence profile)Perceptron (input: sequence profile)

bonded Non bondedbonded Non bonded

NGDQLGIKSKQEALCIAARRNLDLVLVAP

Plotting the trained weigthsPlotting the trained weigths

Hinton’s plot V L I M F W Y G A P S T C H R K Q E N D 0 & #

Residue

Residue

-5-4-3-2on

bonding state

-2-1 0 12

Posi

tio

V L I M F W Y G A P S T C H R K Q E N D 0 & #

2 3 4 5

non bonding

-5-4-3-2nbonding

state2-1 0 12os

itio

n

2 3 4 5

Po

It is possible to add a sintax?It is possible to add a sintax?

Begin

1 2

Bonded statesFree states

3 4

o ded states

End

A pathA path

Bonding Begin

ondingResidue State StateC401 2C43C68C68

3 4

End

A pathA path

Bonding Begin

ondingResidue State StateC40 1 F1 2C43C68C68

3 4P(seq) = P(1 | Begin) P(C40 | 1) ...

End

A pathA path

Bonding Begin

ondingResidue State StateC40 1 F1 2C43 2 BC68C68

3 4P(seq) = P(1 | Begin) P(C40 | 1) ...

P(2 | 1) P(C43 | 2) ..

End

A pathA path

Bonding Begin

ondingResidue State StateC40 1 F1 2C43 2 BC68 4 BC68 4 B

3 4P(seq) = P(1 | Begin) P(C40 | 1) ...

P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4)

End P(4 | 2) P(C68 | 4) ..

A pathA path

Bonding Begin

ondingResidue State StateC40 1 F1 2C43 2 BC68 4 BC68 4 B

3 4P(seq) = P(1 | Begin) P(C40 | 1) ...

P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4)

End P(4 | 2) P(C68 | 4) .. P(End | 4)

4 possible paths4 possible paths

Begin

Bonding Residue State State

Begin

1 2

Bonding Residue State State

1 2 C40 1 FC43 2 BC68 4 B

1 2 C40 2 BC43 3 FC68 4 B

E d

43C68 4 B

End

43C68 4 B

Begi

End

B i

End

Begin

1 2

Bonding Residue State StateC40 1 F

Begin

1 2

BondingResidue State StateC40 2 B

43

C43 1 FC68 1 F

43

C43 4 BC68 1 F

End End

W1 W2 W3

Hybrid systemHybrid system

MYSFPNSFRFGWSQAGFQCEMSTPGSEDPNTDWYKWVHDPENMAAGLCSGDLPENGPGYWGNYKTFHDNAQKMCLKIARLNVEWSRIFPNP...

P(B|W1), P(F|W1) P(B|W3), P(F|W3)P(B|W2), P(F|W2)

Begin

Free Cys

Bonded Cys

EndEnd

Viterbi path

Prediction of bonding state of cysteines

Prediction for TriparedoxinPrediction for Triparedoxin

R idResidue C40C43C43C68


NN Output NN predR id B FResidue B FC40 99 1 BC43 82 18 BC43 82 18 BC68 61 39 B


Begin

1 21 2

End

43

NN Output NN pred HMM HMM predR id B F Vit bi th

End

Residue B F Viterbi pathC40 99 1 B 2 BC43 82 18 B 4 BC43 82 18 B 4 BC68 61 39 B 1 F

PerformancePerformance

)Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot

Neural Network

Q Q( ) Q( ) ( ) ( ) Q pWD 80.4 0.56 67.2 87.5 74.3 83.2 56.9 RD 80.1 0.56 67.2 87.6 75.7 82.2 49.7

)Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot

Hybrid systemQ Q( ) Q( ) ( ) ( ) Q p

WD 88.0 0.73 78.1 93.3 86.3 88.8 84.0 RD 87.4 0.73 78.1 92.8 86.3 88.0 80.2

B= cysteine bonding state, F=cysteine free state. WD= whole database (969 proteins, 4136 cysteines) RD= Reduced database, in which the chains containing only one cysteine are removed (782 proteins, 3949 cysteines).

Martelli PL, Fariselli P, Malaguti L, Casadio R. -Prediction of the disulfide bonding state ofcysteines in proteins with hidden neural networks- Protein Eng. 15:951-953 (2002)

Problem no 2:

When the bonding state of cysteines is known When the bonding state of cysteines is known can we

predict the connectivity pattern of disulfide predict the connectivity pattern of disulfide bonds?

Prediction of the connectivity of Prediction of the connectivity of disulfide bonds in proteins

Prediction of disulfide connectivity in proteins Bovine trypsin Inhibitor 6PTI

N5

5514

38N

51

30

C

30

5 14 30 38 51 55connectivity pattern

... Sequence

connectivity pattern

Prediction of disulfide connectivity in proteins as a problem of maximum-weight perfect matching

Representation:

C 2Protein sequence

p

Cys2W23

W13W12

q

Cys3Cys1W24

W13

W34

Cys4

W14W34

N Cys4C

The undirected weighted graph with V=2B vertices (no of cysteines) and E=2B(2B-1)/2 undirected edges (strength of the interaction W)

From the Graph Theory:

ll h

p y

•It is not necessary to compute all the possible connectivity patterns ( (i B) (2i-1))

•Given a complete graph G=(2B E)Given a complete graph G=(2B,E)the matching with the maximum weight

b t d i O((B)3) ti can be computed in a O((B)3) time with the Edmonds-Gabow’s algorithm*

* Gabow H N (1975) Technical Report CU-CS-075-75 Dept of Comp Gabow, H.N. (1975). Technical Report,CU-CS-075-75, Dept. of Comp. Sci. Colorado University

H t i th t (W) How to assign the costs (W) of the edges in the graphof the edges in the graph

Cys2W23

Cys2Cys2W23W23

Cys3Cys1

W23W13

W12Cys3Cys1 Cys3Cys1

W23W13

W12 W23W13

W12

W24W14

W34W24W14

W34W24W14

W34

Cys4N

C

Cys4Cys4N

C

N

CC C C

Assumption: for each cysteine all its sequence p y qnearest neighbours make contacts

neighbours (Ni)

All possible interactionsusing 1 nearest neighbour

Cys iCys i Cys j

Cys i

Cys j

neighb(N

j

CN

y j boursj)

Frequency distribution of disulfide bonds with respect to sequence separation (726 proteins)

16

respect to sequence separation (726 proteins)

14

16

10

12

%)

6

8

uenc

y(%

2

4

6

Freq

ue

0

2

0 50 100 150 200 250 300 350 400 450Sequence separation

Neural Networks for predicting the edge values

Neural Networks for predicting the edge values edge values edge values

Disulfide pair propensity (output = wij)

Output ( 1 d )( 1 node)

Hidden nodes(6 nodes)( )

I t Each pair in the neighbours of 4 residues Input(212 nodes)

Each pair in the neighbours of 4 residues + Sequence separation + No of SS bonds

(210 + 2 Input nodes)

Accuracy (Qp) of EG vs NN

Chains B Random EG NN

158 2 0.333 0.46 0.68

153 3 0.067 0.17 0.21

103 4 0 009 0 11 0 20103 4 0.009 0.11 0.20

44 5 0.001 0.00 0.02

The state of art:

•Prediction of bonding states is quite satisfactorysatisfactory

•Prediction of connectivity needs to bePrediction of connectivity needs to beimproved

P di ti f F ldP di ti f F ldPrediction of FoldonsPrediction of Foldons

Piero Fariselli

The Folding Problem as a Mapping The Folding Problem as a Mapping ProblemProblem

Covalent structureCovalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

3D structure

EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

3D structure

Ct

We can collect from the PDB data base some 1500 chains of known structures from which to derive non redundant information relating gsequence to:

• secondary structure

• structural and functional motifs

• 3D structure

1 Y K D Y H S - D K K K G E L - -

Evolutionary information2 Y R D Y Q T - D Q K K G D L - -

3 Y R D Y Q S - D H K K G E L - -4 Y R D Y V S - D H K K G E L - -5 Y R D Y Q F - D Q K K G S L - -6 Y K D Y N T - H Q K K N E S - -7 Y R D Y Q T D H K K A D L

information

•Multiple Sequence

MSA

7 Y R D Y Q T - D H K K A D L - -8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K

sequence position

Alignment (MSA) of similar sequences

M

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0

•Sequence profile: for each position a 20-valued vector contains F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0

G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0

valued vector contains the aminoacidic composition of the li d p

rofil

e

M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0

aligned sequences.

quen

ce p

T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

Se

Prediction of Initiation Sites of Protein FoldingPrediction of Initiation Sites of Protein Folding

The Early Stages of Folding:

g

The Folding ProcessThe Folding Process h Ear y Stag s of Fo ngInitiation SitesThe Unfolded ChainFolded Protein

Frustration in proteins

• The simultaneous minimisation of all the interaction energies is impossible

The network architecture

Output Non

p

Hidden

Input

..ALS.......QGFLLIARQPPFTYFTV......HW..

Input Window Q Q

The prediction efficiency of the network

Q2 = 0.85 Q(H) = 0.67 Q(nonH) = 0.93 Sovpred = 0.85

C = 0.63 Pc(H) = 0.80 Pc(nonH) = 0.86 Sovobs = 0.76

Theoretical background

The conformation of residue R depends both on local (window W)and non local (context C) interactions.

Context CContext C

Residue RWindow W

Neural Network

The convergence theorem ensures that:

O Onon The convergence theorem ensures that:

Oi = Probability ( StructureR= i | W )

If for any i O 1 then the structure of residue R dependsIf , for any i, Oi 1 , then the structure of residue R depends mainly on W and only slightly on C

R W C

P ( | ) ( ) W C ( W C)

R W C• Anfinsen’s hypothesis:

P ( | , ) ( , ) i i natW C ( W,C)

• Averaging over all the contexts (performed by NN):

C

P W W C P Ci i( | ) ( | , ) ( ) P

g g (p y )

C

• When the pattern is self-stabilising (W dependent):P ( | , )i W C P ( | )i W=

Th th A fi ’ h th i b t i l l f

• Then the Anfinsen’s hypothesis can be cast in a local form:

P W W C P Ci iC

i nat( | ) ( | , ) ( ) ( , (W) ) P

Relationship between the reliability index and the Shannon entropy

1

Shannon entropy

0.8

0.9

1

0.6

0.7

ity In

dex

0.4

0.5

Rel

iabi

li

0.2

0.3

0

0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Entropy (S5)

INPUT

MAS..... QLMLKDFLNRTPL.........GHI

......... ..........

O O non-

S Oi l OiS = i Oi log Oi_

Protein segments correctly predicted in -helical structure

100

80

100

40

60NC / NT (%)

0

20

13579

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

0

Entropy Segment length

Entropy = Shannon-entropy in (ln 2)/10 units ( S = -i o i ln ( o i ) )

9Segment length

Entropy Shannon entropy in (ln 2)/10 units ( S i o i ln ( o i ) )NC = Number of protein segments correctly predicted in -helixNT = Total number of protein segments predicted in -helix

Profile of the smoothed entropy (S5) for the hen egg lysozyme (132L)

0.7lysozyme (132L)

S5

0 5

0.6 EntropyPredicted helicesE t t d f t

0.4

0.5 Extracted fragments

0.3

0.1

0.2

01 11 21 31 41 51 61 71 81 91 101 111 121

Protein chain

Hen egg lysozyme (132L)gg y y ( )

C terminusN-terminus

C-terminus

Frequency distribution of predicted helical segments as a function of their entropy value

C

segments as a function of their entropy value

0 15

0.2 Frequency CorrectWrongDifferences

0.1

0.15 Differences

0.05

00.1 0.2 0.3 0.4 0.5 0.6 0.70.0

0 1

-0.05 Entropy (S5)Threshold value

-0.1

An example of the data base of minimally frustrated protein fragmentsprotein fragments

http://www.biocomp.unibo.it/DB/ CODE ENTROPY POSITIONS SEQUENCE DSSP SECONDARY STRUCTURE 1msk_ 0.002 192-206 ADRLAEAFAEYLHER HHHHHHHHHHHHHHH 1pyda 0.004 307-319 MKFVLQKLLTNIA HHHHHHHHHHHHH1ngr 0 005 63 72 LDALLAALRR HHHHHHHHHH 1ngr_ 0.005 63-72 LDALLAALRR HHHHHHHHHH 1sly_ 0.005 338-346 AKEILHQLM HHHHHHHH! 1aerb 0.006 20-28 VERLLQAHR HHHHHHHHH 1bcn_ 0.006 113-123 LENFLERLKTI HHHHHHHHHHH 1bib_ 0.006 215-226 LAAMLIRELRAA HHHHHHHHHHHH 1fkx_ 0.006 337-346 KKELLERLYR HHHHHHHHHH 2arcb 0.006 148-158 NLLEQLLLRRM HHHHHHHHHHH1aqt 0 008 112-125 DYAQASAELAKAIA HHHHHHHHHHHHHH 1aqt_ 0.008 112 125 DYAQASAELAKAIA HHHHHHHHHHHHHH 1fit_ 0.008 111-120 EEEXAAEAAA HHHHHHHHHH 1mtyg 0.009 22-30 LEKAAEMLK HHHHHHHHH 2tct_ 0.009 50-60 LLDALAVEILA HHHHHHHHHHH 1hsba 0.010 150-157 AHVAEQWR !!HHHHHH 2chsa 0.010 17-26 EEILQKTKQL HHHHHHHHHH 1hjp_ 0.011 175-184 ETLIREALRA HHHHHHHHH!1pou 0.011 5-13 LEQFAKTFK HHHHHHHHH 1pou_ 0.011 5 13 LEQFAKTFK HHHHHHHHH

..........................................................................................................................

Training set from PDBTraining set from PDB

Number of Number of Number of Averageproteins amino acids -helices

glength

822 174191 4783 116

Data base of minimally frustrated -helical segments

Number ofproteins

Number ofamino acids

Number of-helical segments

Averagelength

626 21553 3000 72

Comparison of minimally frustrated segments with t ti f ldi i iti ti it i t ll d t i dputative folding initiation sites experimentally determined

i i i fPDBCode

Entropy(S5)

Position in theprotein chain

ExtractedSequence

Reference

132l 0.109 8-14 LAAAMKR Radford & al. (1992)3 0. 09 8 Radford & al. (1992)132l 0.212 89-95 TASVNCA Radford & al. (1992)1hfx 0.186 86-91 TDDIMC Chyan & al. (1993)1hfx 0 221 7-13 ALSHELN *1hfx 0.221 7 13 ALSHELN *1hrc 0.156 92-99 EDLIAYLK Jeng & al. (1990)2mm1 0.050 127-132 AQGAMN Hugson & al. (1990)2mm1 0 104 139 146 RKDMASNY H & l (1990)2mm1 0.104 139-146 RKDMASNY Hugson & al. (1990)2mm1 0.154 105-111 EFISEAI Hugson & al. (1990)7rsa 0.409 8-11 FERQ Udgaonkar & Baldwin (1990)1 0 322 2 281ubq 0.322 25-28 NVKA Briggs & Roder (1992)1gf1 0.311 10-16 LVDALQF Hua & al. (1996)2ci2 0.236 14-19 VEEAKK Fersht (1995)

*Not yet experimentally detected

Comparison of minimally frustrated segments with peptides extracted from proteinspeptides extracted from proteins

Code* Peptides* % Helix insolution*

Entropy(S5)

ExtractedSegment

3FXC TYKVTELINEAEGINETIDCDD 1 ##### ####3LZM GFTNSLRMLQQKRWDEAVNLAKS 10 0.262 WDEAVNL

“ 10 0.329 LRMLQQK3LZM-2 GVAGFTNSLRMLQQKRWDEAAVNLAKS 12 0.203 SLRMLQ

“ 12 0 210 DEAAVNL 12 0.210 DEAAVNLCIII ESLLERITRKLRDGWKRLIDIL 8 0.171 LLERIT

“ 8 0.260 WKRLIDCIII-L ESLLERITRKL 15 0.171 LLERITCIII-R RDGWKRLIDIL 4 0.260 WKRLIDCIII-M RITRKLRDGWK 2 #### ####Sigma KVATTKAQRKLFFNLRKTKQRL 9 0.218 TKAQRKCOMA1 DHPAVMEGTKTILETDSNLS 4 #### ####COMA2 EPSEQFIKQHDFSSY 3 #### ####

3 6 0 189COMA3 VNGMELSKQILQENPH 6 0.189 LSKQILQCOMA4 EVEDYFEEAIRAGLH 20 0.020 YFEEAIRCOMA5 KEKITQYIYHVLNGEIL 3 #### ####ARA1 AVGKSNLLSRYARNEFSA 2 #### ####ARA2 RFRAVTSAYYRGAVG 3 #### ####ARA2 RFRAVTSAYYRGAVG 3 #### ####ARA3 TRRTTFESVGRWLDELKIHSD 7.5 0.194 SVGRWLARA4 AVSVEEGKALAEEEGLF 4 #### ####ARA5 STNVKTAFEMVILDIYNNV 3 #### ####G1 DTYKLILNGKTLKGETTTEA 2 #### ####G2 GDAATAEKVFKKIANDNGVD 4 #### ####G3 GEWTYDDATKTFTVTE 2 #### ####

* Muñoz and Serrano, 1994.

Minimally frustrated -helical segments are f l f d i iuseful for determining:

• Folding initiation sites

• -helix stability

• de-novo design of -helicesde novo design of helices

Structure prediction of membrane u u p f m mproteins

Outer Membrane proteinsOuter Membrane proteins(all -Transmembrane proteins)

Inner Membrane proteins( ll T b t i )(all -Transmembrane proteins)

Outer Membrane Inner Membrane-barrel -helices

Porin (Rhodobacter capsulatus)

Bacteriorhodopsin(Halobacterium salinarum)

Predictors of the Topology of Membrane Proteins

position of Trans Membrane Segments along the sequenceTopography

Lipidic Bilayer

Out

Lipidic Bilayer

+ Bilayer

N In+

+++ +

Topology

N

C

In

position of N and C termini with Topology position of N and C termini with respect to the bilayer

Prediction of transmembrane segments

TM nonTMNeural Network for the prediction of TMS in

b l b TM nonTM

2 output neurons-barrel membrane proteins. (Jacoboni et al., 2001)

5 hidden neurons

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 0 60 0 0 0 0 0 0 0 0 0 00 0 0 10 0 33 0 0 010 0 30 0 30 0 100 0 00 0 0 0 10 0 0 10 300 0 0 0 10 0 0 10 300 40 0 0 0 0 0 0 100 0 0 0 0 0 0 0 300 0 0 0 0 0 0 30 00 0 0 0 0 0 0 0 00 0 0 0 10 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 40 0 0 0 300 50 0 0 0 0 0 0 00 50 0 0 0 0 0 0 00 0 0 0 0 33 0 0 020 0 0 0 0 33 0 0 00 0 0 0 10 0 0 0 00 10 0 0 0 0 0 0 070 0 0 90 0 0 0 0 0

Window: 9 residues

A generic model for membrane proteins (TMHMM)A generic model for membrane proteins (TMHMM)End End

Transmembrane Inner SideOuter Side

Begin

Sequence-profile-based HMMSequence-profile-based HMMSequence of

0 0 0 0 8 90 3

..A C L P R P E T ...

t

Sequence of characters ct

90 08500

0000

00220

03400

8000

90000

3027

tSequence of

A di i l t st

90000

5000

40130

23005

02400

0000

0000

4086

A-dimensional vectors st0000 For proteins A=200

020

0405

50230

0000

0000

0000

6136

00100

For proteins A=20

Constraints8000

0600

30110

20220

09200

0000

55125

0 st (n) S t,nS=100

0000

Constraints

0000

02301

0020

01800

0000

00077

56172

S 100

k=1 st (n) = S tA

0000

00

440

110

00

00

023

26

k=1 t ( )00

nMartelli et al., Bioinformatics 18, S46-53, 2002

The new algorithms make possible:

•to feed HMMs with sequence profiles•to feed HMMs with sequence profiles

•to eventually couple NNs and HMMs (Hidden u y up NN HMM (HNeural Networks)

dAdvantages:

•Higher performance than standard HMMs•Higher performance than standard HMMs

•Increased discrimination capability of a given p y f gclass

Martelli et al., Bioinformatics, 2002Martelli et al., Bioinformatics, 2002Martelli et al., Protein Eng. 2002,

Prediction of the Topology of -Transmembrane Proteins

position of Trans Membrane Helices along the sequenceTopography

The prediction accuracy of topography is 92%

OutThe prediction accuracy of topology is 81 %

+ Bilayer

N In+

+++ +

Topology

N

C

In

position of N and C termini with Topology position of N and C termini with respect to the bilayer

Prediction of the Topology of -Transmembrane Proteins

position of Transmembrane Strands along the sequenceTopography:

The prediction accuracy of topography is 73 %

Th p di ti n f t p l i 73 %LPS (Out)

The prediction accuracy of topology is 73 %

+ Bilayer

N

+++

+ +

Topology:

N

CPeriplasmic (In)

position of N and C termini with Topology: position of N and C termini with respect to the bilayer

100

The discriminative capability of the HMM model

90

100

70

80

e

Outer membrane

Globular

50

60

rcen

tage Inner membrane

30

40Per

10

20

02.75 2.8 2.85 2.9 2.95

I(s | M) = -1/L log P(s | M)

An application: modeling the 3D structure of eukaryotic barrel

proteinsproteins

3D structure predicti n f pr teins3D structure prediction of proteins

New folds Existing foldsNew folds Existing foldsMembrane proteins

Threading/ Ab initio Building by h lfold

recognitionprediction homology

0 10 20 30 40 50 60 70 80 90 100 Homology (%)

Structural alignment of VDAC with the template 2omf_.seq/ AEIYNKDGNK VDLYGKAVGL HYFSKGNGEN SYGGNGDMTY ARLGFKGETQ 2omf_.str/ CCCCCCCCEE EEEEEEEEEE EEECCCCCCC CCCCCCCCCE EEEEEEEEEE protx.str/ *******CCC CCCCEEEEEE EEEC****** ********CE EEEEEEEECC protx.seq/ *******KGY NFGLWKLDLK TKTS****** ********SG IEFNTAGHSN 2omf_.seq/ I*NSDLTGYG QWEYNFQGNN SEGADAQTGN KTRLAFAGLK YADVGSFDYG 2omf_.str/ C*CCCEEEEE EEEEEEECCC CCCCCCCCCC EEEEEEEEEE ECCCEEEEEE protx.str/ CCCCCEEEEE EEEEEEC*** ********** EEEEEEEEEC CCCCCEEEEE protx.seq/ QESGKVFGSL ETKYKVK*** ********** DYGLTLTEKW NTDNTLFTEV 2omf_.seq/ RNYGVVYDAL GYTDMLPEFG GDTAYSDDFF VGRVGGVATY RNSNFFGLVD_ q2omf_.str/ ECCCCCCCCC CCCCCCCCCC CCCCCCCCCC CCCCCCEEEE EECCCCCCCC protx.str/ EEEECC**** ********** ********** **CCEEEEEE EEECCCCCCC protx.seq/ AVQDQL**** ********** ********** **LEGLKLSL EGNFAPQSGN 2omf_.seq/ GLNFAVQYLG KNER****** *********D TARRSNGDGV GGSISYEYE* 2omf_.str/ CEEEEEEEEC CCCC****** *********C CCCCCCCCEE EEEEEEEEC* protx str/ EEEEEEEEEE EEEECCCCCC CCCCCCCEEE EEEEEEEEEE EEEEEEECCCprotx.str/ EEEEEEEEEE EEEECCCCCC CCCCCCCEEE EEEEEEEEEE EEEEEEECCCprotx.seq/ KNGKFKVAYG HENVKADSDV NIDLKGPLIN ASAVLGYQGW LAGYQTAFDT 2omf_.seq/ **GFGIVGAY GAADRTNLQE AQPLGNGKKA EQWATGLKYD ANNIYLAANY 2omf_.str/ **CEEEEEEE EEEECCCCCC CCCCCCCCEE EEEEEEEEEE ECCEEEEEEE protx.str/ CCEEEEEEEE EEEEEEEEEE EEECCCCCCC EEEEEEEEEE CEEEEEEEEE protx.seq/ QQSKLTTNNF ALGYTTKDFV LHTAVNDGQE FSGSIFQRTS DKLDVGVQLSp q QQ Q Q Q 2omf_.seq/ GETRNATPIT NKFTNTSGFA NKTQDVLLVA QYQFDFGLRP SIAYTKSKAK 2omf_.str/ EEEECCCCCC CCCCCCCCCC CEEEEEEEEE EEECCCCEEE EEEEEEEEEE protx.str/ EEECC***** ********** *CCCEEEEEE EEECCCCEEE EEEEEEC*** protx.seq/ WASGT***** ********** *SNTKFAIGA KYQLDDDARV RAKVNNA*** 2omf seq/ DVEGIGDVDL VNYFEVGATY YFNKNMSTYV DYIINQIDSD NKLGVGSDDT2omf_.seq/ DVEGIGDVDL VNYFEVGATY YFNKNMSTYV DYIINQIDSD NKLGVGSDDT2omf_.str/ CCCCCCCEEE EEEEEEEEEE ECCCCEEEEE EEEEECCCCC CCCCCCCCCE protx.str/ *********E EEEEEEEEEE EC***EEEEE EEEEECCC** *****CCCCE protx.seq/ *********S QVGLGYQQKL RT***GVTLT LSTLVDGK** *****NFNAG 2omf_.seq/ VAVGIVYQF* *** 2omf .str/ EEEEEEEEE* ***_ /protx.str/ EEEEEEEEEE EC* protx.seq/ GHKIGVGLEL EA*

A low resolution 3D Model of VDAC h f N )the sequence from Neurospora crassa)

Ca

A low resolution 3D model of VDAC:l f d dlocation of mutated residues

Casadio et al., FEBS Lett 520:1-7 (2002) , ( )

Predictors of membrane protein pstructures can be used to filter genomes and find new membrane genomes and find new membrane

proteins without sequence homologoueshomologoues

FISHING NEW OUTER FISHING NEW OUTER MEMBRANE PROTEINS IN

GRAM-NEGATIVE BACTERIAMEMBRANE PROTEINS IN

GRAM-NEGATIVE BACTERIAGRAM-NEGATIVE BACTERIAGRAM-NEGATIVE BACTERIA

Proteins have intrinsic signals that govern their g gtransport and localization in the cell: a secretion hydrophic marker (or signal peptide)

Signal peptides in protein sequences:MRAKLLGIVLTTPIAISSFASTETLSFTPDNINADISLGTLSGKTKERVYLAEEGGRKVSQLDWKFNNAAIIKGAINWDLMPQISIGAAGWTTLGSRGGNMVDQDWMDSSNPGTWTDESRHPDTQLNYANEFDLNIKGWLLNEPNYRLGLMAGYQESRYSFTARGGSYIYSSEEGFRDDIGSFPNGER

g p p p q

AIGYKQRFKMPYIGLTGSYRYEDFELGGTFKYSGWVESSDNDEHYDPGKRITYRSKVKDQNYYSVAVNAGYYVTPNAKVYVEGAWNRVTNKKGNTSLYDHNNNTSDYSKNGAGIENYNFITTAGLKYTF

Sequences of outer membrane proteins have signal peptides:

th ti k i l k f t the secretion marker is also a marker of outer membrane proteins

Signal Peptide prediction Signal Peptide prediction

Signal Pepetide Mature protein

MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF

g p p


Cleavage siteg

2 Neural Networs


SignalNet CleavageNet

Predicts if a given id i i

Predicts if a given residue position belongs to the Signal Pepetide

f gresidue position is the cleavage site

Signal Pepetide

SignalNet Accuracy

Organism Window C Q2

Eukaryotes 15-1-15 0.83 0.95 Gram positive 15-1-15 0.79 0.92Gram negative 11-1-11 0 78 0 92Gram negative 11-1-11 0.78 0.92

CleavageNet AccuracyCleavageNet Accuracy

Organism Window C Q2

Eukaryotes 15-1-2 0.61 0.97Gram positive 20-1-3 0.56 0.96Gram negative 11-1-2 0 62 0 96Gram negative 11-1-2 0.62 0.96

Comparison with SignalP

Organism SignalP SPEP

Eukaryotes (+) 0.99 0.97 Eukaryotes (-) 0.85 0.94

Prokaryotes(+) 0.99 0.97Prokaryotes ( ) 0 93 0 96Prokaryotes (-) 0.93 0.96

Escherichia coli(+/-) 0.95 0.96( )

Performance of SignalNN on 2160 annotated proteins

PredictionWithout T t lWith

250

W thoutsignal Total

W thsignal

th nal

205 45

Q2 = 96 %

Q 82 %250

tati

onou

tal

Wit

sign 205 45 Qsignal = 82 %

Qnon-signal = 97 %

Ann

otW

itho

sign

aal

1910185555 Psignal = 78 %Pnon-signal = 98 %

2160Tota

l

260 1900non signal

Correct predictions

Wrong predictionsWrong predictions

Predictors of Membrane Topography: Rate f f l i iof false positives

The predictors are tested on on 809 globular protein with sequence identity 25 % :with sequence identity 25 % :

0.5 % have at least 1 -TM helix predicted. p

5.6 % have at least 2 -TM strand predicted

PROTEOME

Signal peptide

HUNTERSignal peptide

All- TM All- TM

All- TM

all -TM all -TMall -TMGlobular Globular

Predicting globular, inner and outer membrane proteins in genomes of Gram-negative bacteria with Hunter

Organism Outer membrane

Inner membrane

Globular Total

Escherichia coli K12 65 (1.6%) 907 (21.7%) 3201 (76.7%) 4173 New*

( )18

( )136

( )1099 1253

Escherichia coli O157:H7 New

78 (1.5%)10

1034 (19.3%)327

4249 (79.2%)1564

53611901

Chlamidia pneumoniae CWL029 12 (1.1%) 290 (27.6%) 750 (71.3%) 1052Chlamidia pneumoniae CWL029 New

12 (1.1%)2

290 (27.6%)181

750 (71.3%)236

1052419

Salmonella typhimurium LT2 New

70 (1.6%)0

1002 (22.5%)2

3379 (75.9%)21

445123

Neisseria meningitidis MC58 34 (1.7%) 372 (18.4%) 1619 (80.0%) 2025Neisseria meningitidis MC58 New

34 (1.7%)6

372 (18.4%)176

1619 (80.0%)662

2025844

Helicobacter pylori 26695 New

36 (2.3%)10

352 (22.5%)141

1178 (75.2%)445

1566596

Haemophylus influentiae Rd 23 (1 3%) 348 (20 4%) 1338 (78 3%) 1709Haemophylus influentiae Rd New

23 (1.3%)5

348 (20.4%)121

1338 (78.3%)430

1709556

Thermotoga maritima New

18 (1.0%)11

370 (20.0%)203

1458 (79.0%)559

1846773

Pseudomonas aeruginosa 131 (2 4%) 1292 (23 2%) 4142 (74 4%) 5565Pseudomonas aeruginosa New

131 (2.4%)62

1292 (23.2%)616

4142 (74.4%)1867

55652545

* the number of new proteins predicted in the class with Hunter, out of the non annotated regionnon-annotated region

prediction of structural and functional features in proteins startin … neurali... · 2012. 9....

Documents