prediction of structural and functional features in proteins startin … neurali... · 2012. 9....
TRANSCRIPT
-
Prediction of structural and functional features in proteins startin from the features in proteins starting from the
residue sequence
INTRODUCTION TO NEURAL NETWORKSNETWORKS
-
MAPPING PROBLEMS: Secondary structure
Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
3D structure
.. .... . ...........
3D structure
Ct
-
MAPPING PROBLEMS: Topology of transmembrane proteins
position of Trans Membrane Segments along the sequenceTopographyALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
proteins
Outer Membrane Inner Membrane-barrel -helices
Outer Membrane Inner Membrane
Porin (Rhodobacter capsulatus)
Bacteriorhodopsin(Halobacterium salinarum)
-
First generation methodsFirst generation methodsSingle residue statisticsSingle residue statistics
Propensity scales
F h id For each residue
•The association between each residue and the different The association between each residue and the different features is statistically evaluated
•Physical and chemical features of residues
A it l f t t b i t d A propensity value for any structure can be associated to any residue
HOW?
-
Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale
Gi s t f k st t s t h mGiven a set of known structures we can count how manytimes a residue is associated to a structure.
Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh
N(A h) 7 N(A c) 1 N 40N(A,h) = 7, N(A,c) =1, N= 40
P(A h) = 7/40 P(A h) = 1/40P(A,h) 7/40, P(A,h) 1/40
Is that enough for estimating a propensity?
-
Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale
Gi s t f k st t s t h mGiven a set of known structures we can count how manytimes a residue is associated to a structure.
Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh
N(A h) 7 N(A c) 1 N 40N(A,h) = 7, N(A,c) =1, N= 40
P(A h) = 7/40 P(A h) = 1/40P(A,h) 7/40, P(A,h) 1/40
We need to estimate how much independent the residue-to-structure association is.
P(h) = 27/40 P(c) = 13/40P(h) = 27/40, P(c) = 13/40
-
Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale
Given a set of known structures we can count how manyGiven a set of known structures we can count how manytimes a residue is associated to a structure.
Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALh h h h h hhhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh
N(A h) = 7 N(A c) =1 N= 40N(A,h) = 7, N(A,c) =1, N= 40
P(A,h) = 7/40, P(A,h) = 1/40( , ) , ( , )
P(h) = 27/40, P(c) = 13/40
If the structure is independent of the residue:P(A h) = P(A)P(h)P(A,h) = P(A)P(h)
The ratio P(A,h)/P(A)P(h) is the propensity
-
Gi LARGE s t f x m l s sit l b
Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale
Given a LARGE set of examples, a propensity value can becomputed for each residue and each structure type
Name P(H) P(E) Alanine 1,42 0,83Arginine 0,98 0,93Aspartic Acid 1,01 0,54Asparagine 0,67 0,89Cysteine 0,70 1,19Glutamic Acid 1,51 0,37Glutamine 1,11 1,10Glycine 0,57 0,75Histidine 1,00 0,87Isoleucine 1,08 1,60, ,Leucine 1,21 1,30Lysine 1,14 0,74Methionine 1,45 1,05Phenylalanine 1 13 1 38Phenylalanine 1,13 1,38Proline 0,57 0,55Serine 0,77 0,75Threonine 0,83 1,19Tryptophan 1 08 1 37Tryptophan 1,08 1,37Tyrosine 0,69 1,47Valine 1,06 1,70
-
Gi s s d st t di ti
Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale
Given a new sequence a secondary structure prediction canbe obtained by plotting the propensity values for eachstructure residue by residuestructure, residue by residue
T S P T A E L M R S T GT S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75
Considering three secondary structures (H,E,C), the overall accuracy, as evaluated on an uncorrelated set of sequences with known structure, is very lowQ3 = 50/60 %
-
http://www expasy ch/cgi-bin/protscale pl
Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale
http://www.expasy.ch/cgi bin/protscale.pl
-
Transmembrane alphaTransmembrane alpha--helices: Kytehelices: Kyte--Doolittle scaleDoolittle scale
It is m t d t ki i t sid ti th t l tIt is computed taking into consideration the octanol-waterpartition coefficient, combined with the propensity of theresidues to be found in known transmembrane helicesresidues to be found in known transmembrane helices
Ala: 1.800 Arg: -4.500 Asn: -3.500 Asp: -3.500Asn: 3.500 Asp: 3.500 Cys: 2.500 Gln: -3.500 Glu: -3.500 Gly: -0.400 His: -3.200 Ile: 4.500 Leu: 3.800 Lys: -3.900 Met: 1.900 Phe: 2.800 Pro: -1.600 Ser: -0.800 Thr: -0.700 Trp: -0.900 Tyr: -1.300 Val: 4.200
-
Second generation methods: GORSecond generation methods: GOR
The structure of a residue in a protein strongly depends on the sequence contexton the sequence context
It is possible to estimate the influence of a residue in determining the structure of a residue close along the sequence. Usually windows from -8/8 to -13/13 are consideredconsidered.
Coefficients P(A s i) estimate the contribution of the Coefficients P(A,s,i) estimate the contribution of the residue A in determining the structure s for a residue that is i positions apart along the sequence
-
Struttura secondaria: Metodo GORStruttura secondaria: Metodo GOR
Q3 = 65 % (Considering three secondary structures (H,E,C), and evaluating the overall accuracy on an uncorrelated set of sequences with known structure)
The contribution of each position in the window is independent of the other ones No correlation among the independent of the other ones. No correlation among the positions in the window is taken in to account.
-
A more efficient method: Neural NetworksA more efficient method: Neural Networks
Alternative computing algorithm: analogies with theAlternative computing algorithm: analogies with thecomputation in the nervous system.
1) The nervous systems is constituted of elementarycomputing units: neurons2) The electric signal flows in a determined direction2) The electric signal flows in a determined direction(dentrites->axon) (Principle of dynamic polarization)3)There is not cytoplasmic continuity among the neurons.3)There is not cytoplasmic continuity among the neurons.Each neuron specifically communicates with someneighboring neurons by means of synapses (Principle ofconnective specificity)
-
Tools out of machine learning approachesTools out of machine learning approachesNeural Networks can learn the mapping fromsequence to secondary structureNeural Networks can learn the mapping fromsequence to secondary structure
PredictionN
q yq y
Data Base SubsetTraining
New sequenceData Base SubsetTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
General lrules
PredictionEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE PredictionKnown mapping
-
Neural network for secondary structure predictionNeural network for secondary structure prediction
Output
C
Output
Inputp
M P I L K QK P I H Y H P N H G E A K GA 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 0 0 0 0ll C 0 0 0 0 0 0 0 0 0D 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0 0 0 0G 0 0 0 0 0 0 0 0 0
Usually:Input 17-23 residues
H 0 0 0 1 0 1 0 0 1I 0 0 1 0 0 0 0 0 0K 1 0 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0
Hidden neurons :4-15
N 0 0 0 0 0 0 0 1 0P 0 1 0 0 0 0 1 0 0Q 0 0 0 0 0 0 0 0 0R 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 0 0 0 0T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0W 0 0 0 0 0 0 0 0 0Y 0 0 0 0 1 0 0 0 0
-
D (L)
R (E)
ACDEF
Q (E)
G (E)
( )FGHIKLM
HF (E)
V (E)
P (E)MNPQRS
E
L
P (E)
A (H)
A (H)TVWY.
A (H)
Y (H)
V (E)V (E)
K (E)
K (E)( )
-
Third generation methods: evolutionary informationThird generation methods: evolutionary information
1 Y K D Y H S - D K K K G E L - -2 Y R D Y Q T - D Q K K G D L - -3 Y R D Y Q S - D H K K G E L - -4 Y R D Y V S - D H K K G E L - -5 Y R D Y Q F - D Q K K G S L - -Q Q6 Y K D Y N T - H Q K K N E S - -7 Y R D Y Q T - D H K K A D L - -8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K10 T K G Y G F G L I K N T E T T K
A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0
Position
E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0
-
The Network Architecture for Secondary Structure PredictionThe Network Architecture for Secondary Structure PredictionSecondary Structure PredictionSecondary Structure Prediction
The First Network (Sequence to Structure)The First Network (Sequence to Structure)
H E C
CCHHEHHHHCHHCCEECCEEEEHHHCC
SeqNo No V L I M F W Y G A P S T C H R K Q E N D
1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 803 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 637 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 09 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0
10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 011 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 013 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0
-
The Network Architecture for Secondary Structure PredictionThe Network Architecture for Secondary Structure PredictionSecondary Structure PredictionSecondary Structure Prediction
The Second Network (Structure to Structure)The Second Network (Structure to Structure)H E C
CCHHEHHHHCHHCCEECCEEEEHHHCC
SeqNo No V L I M F W Y G A P S T C H R K Q E N D
1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 803 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 637 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 09 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0
10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 011 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 013 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0
-
The Performance on the Task of S d S P di iThe Performance on the Task of S d S P di i
The cross validation procedureThe cross validation procedure
Secondary Structure PredictionSecondary Structure Prediction
Protein set
The cross validation procedureThe cross validation procedure
Testing set 1
Training set 1
-
Efficiency of the Neural Network-Based Predictors onthe 822 Proteins of the Testing Set
INPUTQ3 (%) 66.3
Single SOV 0.62Sequence Q[H] 0 69 Q[E] 0 61 Q[C] 0 66Sequence Q[H] 0.69 Q[E] 0.61 Q[C] 0.66
P[H] 0.70 P[E] 0.54 P[C] 0.71C[H] 0.54 C[E] 0.44 C[C] 0.45
Q3(%) 72 4Q3(%) 72.4Multiple SOV 0.69Sequence Q[H] 0.75 Q[E] 0.65 Q[C] 0.75(MaxHom) P[H] 0 77 P[E] 0 64 P[C] 0 73(MaxHom) P[H] 0.77 P[E] 0.64 P[C] 0.73
C[H] 0.64 C[E] 0.54 C[C] 0.53Q3(%) 73.4
Multiple SOV 0 70Multiple SOV 0.70Sequence Q[H] 0.75 Q[E] 0.70 Q[C] 0.73(PSI-BLAST) P[H] 0.80 P[E] 0.63 P[C] 0.75
C[H] 0 67 C[E] 0 56 C[C] 0 53C[H] 0.67 C[E] 0.56 C[C] 0.53
Combinando differenti reti: Q3 =76/78%
-
Secondary Structure PredictionSecondary Structure Prediction
From sequenceFrom sequence
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
EEEE HHHHHHHHHHHH HHHHHHHH EEEE
To secondary structureTo secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
And to the reliability of the predictionAnd to the reliability of the prediction
7997688899999988776886778999887679956889999999
-
P di tP t i B kh d R (C l bi U i )
SERVERSSERVERSPredictProtein Burkhard Rost (Columbia Univ.)
http://cubic.bioc.columbia.edu/predictprotein/
P iPRED D id J (UCL)PsiPRED David Jones (UCL)http://bioinf.cs.ucl.ac.uk/psipred/
JPred Geoff Barton (Dundee Univ.)
SecPRED http://www.biocomp.unibo.it
-
QEALEIA
Chamaleon sequencesChamaleon sequences
QEALEIA
GIKSKQEALEIAARRN FNPQTQEALEIAPSVGV
Translation Initiation Factor 3
……GIKSKQEALEIAARRN……
Transcription Factor 1
……FNPQTQEALEIAPSVGV……
Translation Initiation Factor 3Bacillus stearothermophilus
Transcription Factor 1Bacteriophage Spo1
1WTUA
1TIF
-
We extract: We extract:
from a set of 822 non-homologous proteins
2 452 5 h l s
(174,192 residues)2,452 5-mer chameleons
107 6-mer chameleons16 7-mer chameleons
1 8-mer chameleon
2,576 couples, p
The total number of residues in chameleons is 26,044 out of 755 protein chains (~15%)
-
Prediction of the Secondary Structure of Prediction of the Secondary Structure of Chameleon sequences with Neural NetworksChameleon sequences with Neural Networks
QEALEIAHHHHHHH
QEALEIACCCCCCC
C C
HHHHHHH CCCCCCC
NGDQLGIKSKQEALEIAARRNLDLVLVAP ARKGFNPQTQEALEIAPSVGVSVKPGNGDQLGIKSKQEALEIAARRNLDLVLVAP ARKGFNPQTQEALEIAPSVGVSVKPG
-
The Prediction of Chameleons with Neural N t k
The Prediction of Chameleons with Neural N t kNetworksNetworks
Method Performance on the Performance onMethod Performance on theProtein data set
Performance onChameleon sequences
NN with MSA Input 73.4 % 75.1 %pNN with SS Input 66.3 % 58.9 %GOR IV 64.4% 55.2 %
-
Other neural networkOther neural network--based predictorsbased predictors
•Secondary structure
•Topology of transmebrane proteins
•Cysteine bonding state
•Contact maps of proteins
•Interaction sites on protein surface
-
Prediction of the cysteine bonding statePrediction of the cysteine bonding state
T d i I f C i hidi f i l (1QK8)Tryparedoxin-I from Crithidia fasciculata (1QK8)MSGLDKYLPGIEKLRRGDGEVEVKSLAGKLVFFYFSASWCPPCRGFTPQLIEFYDKFHES KNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQKLSKHFNVESIPTLIGVDADSGKNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQKLSKHFNVESIPTLIGVDADSG DVVTTRARATLVKDPEGEQFPWKDAP
Cys68
Free cysteines
y
Cys40
Disulphide bonded cysteines
y
Cys43
-
A l t k b d A l t k b d A neural network-based method for predicting
A neural network-based method for predicting method for predicting
the disulfide connectivity method for predicting
the disulfide connectivity yin proteins
yin proteins
-
The Protein Folding
T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N
-
The Protein FoldingRPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA
-
Disulfide bonds Disulfide bonds
SS
SC
CSC CC
2-SH -> -SS- + 2H+ + 2e-
S-S distance 2.2 Å
T si n n l C S S C 90°Torsion angle C-S-S-C 90°
Bond Energy 3 Kcal/mol
-
Intra-chain disulfide bonds in proteins
Of 1259 proteins (a non redundant PDB subset):
• 23% of the chainshave disulfide bonds (S S)
• SS distribution (between secondary structures)secondary structures)
% H E CH 7 9 14E 17 27C 26
-
Intra-chain disulfide bonds in proteins
Distribution of disulfide bonds in the SCOP domains
•99 % of the disulfide bonds are intra-domain
•Distribution:Type %AllAll- 13All- 31 / 11 / 11 + 13Small domains 29Others 3
-
Problem no 1:
f h
Problem no 1:
Starting from the protein sequence can we discriminate whether a cysteine residue is
d lf d b d d?disulfide-bonded?
Prediction of the disulfide-bonding f i i istate of cysteines in proteins
-
Perceptron (input: sequence profile)Perceptron (input: sequence profile)
bonded Non bondedbonded Non bonded
NGDQLGIKSKQEALCIAARRNLDLVLVAP
-
Plotting the trained weigthsPlotting the trained weigths
Hinton’s plot V L I M F W Y G A P S T C H R K Q E N D 0 & #
Residue
Residue
-5-4-3-2on
bonding state
-2-1 0 12
Posi
tio
V L I M F W Y G A P S T C H R K Q E N D 0 & #
2 3 4 5
non bonding
-5-4-3-2nbonding
state2-1 0 12os
itio
n
2 3 4 5
Po
-
It is possible to add a sintax?It is possible to add a sintax?
Begin
1 2
Bonded statesFree states
3 4
o ded states
End
-
A pathA path
Bonding Begin
ondingResidue State StateC401 2C43C68C68
3 4
End
-
A pathA path
Bonding Begin
ondingResidue State StateC40 1 F1 2C43C68C68
3 4P(seq) = P(1 | Begin) P(C40 | 1) ...
End
-
A pathA path
Bonding Begin
ondingResidue State StateC40 1 F1 2C43 2 BC68C68
3 4P(seq) = P(1 | Begin) P(C40 | 1) ...
P(2 | 1) P(C43 | 2) ..
End
-
A pathA path
Bonding Begin
ondingResidue State StateC40 1 F1 2C43 2 BC68 4 BC68 4 B
3 4P(seq) = P(1 | Begin) P(C40 | 1) ...
P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4)
End P(4 | 2) P(C68 | 4) ..
-
A pathA path
Bonding Begin
ondingResidue State StateC40 1 F1 2C43 2 BC68 4 BC68 4 B
3 4P(seq) = P(1 | Begin) P(C40 | 1) ...
P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4)
End P(4 | 2) P(C68 | 4) .. P(End | 4)
-
4 possible paths4 possible paths
Begin
Bonding Residue State State
Begin
1 2
Bonding Residue State State
1 2 C40 1 FC43 2 BC68 4 B
1 2 C40 2 BC43 3 FC68 4 B
E d
43C68 4 B
End
43C68 4 B
Begi
End
B i
End
Begin
1 2
Bonding Residue State StateC40 1 F
Begin
1 2
BondingResidue State StateC40 2 B
43
C43 1 FC68 1 F
43
C43 4 BC68 1 F
End End
-
W1 W2 W3
Hybrid systemHybrid system
MYSFPNSFRFGWSQAGFQCEMSTPGSEDPNTDWYKWVHDPENMAAGLCSGDLPENGPGYWGNYKTFHDNAQKMCLKIARLNVEWSRIFPNP...
P(B|W1), P(F|W1) P(B|W3), P(F|W3)P(B|W2), P(F|W2)
Begin
Free Cys
Bonded Cys
EndEnd
Viterbi path
Prediction of bonding state of cysteines
-
Prediction for TriparedoxinPrediction for Triparedoxin
R idResidue C40C43C43C68
-
Prediction for TriparedoxinPrediction for Triparedoxin
NN Output NN predR id B FResidue B FC40 99 1 BC43 82 18 BC43 82 18 BC68 61 39 B
-
Prediction for TriparedoxinPrediction for Triparedoxin
Begin
1 21 2
End
43
NN Output NN pred HMM HMM predR id B F Vit bi th
End
Residue B F Viterbi pathC40 99 1 B 2 BC43 82 18 B 4 BC43 82 18 B 4 BC68 61 39 B 1 F
-
PerformancePerformance
)Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot
Neural Network
Q Q( ) Q( ) ( ) ( ) Q pWD 80.4 0.56 67.2 87.5 74.3 83.2 56.9 RD 80.1 0.56 67.2 87.6 75.7 82.2 49.7
)Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot
Hybrid systemQ Q( ) Q( ) ( ) ( ) Q p
WD 88.0 0.73 78.1 93.3 86.3 88.8 84.0 RD 87.4 0.73 78.1 92.8 86.3 88.0 80.2
B= cysteine bonding state, F=cysteine free state. WD= whole database (969 proteins, 4136 cysteines) RD= Reduced database, in which the chains containing only one cysteine are removed (782 proteins, 3949 cysteines).
Martelli PL, Fariselli P, Malaguti L, Casadio R. -Prediction of the disulfide bonding state ofcysteines in proteins with hidden neural networks- Protein Eng. 15:951-953 (2002)
-
Problem no 2:
When the bonding state of cysteines is known When the bonding state of cysteines is known can we
predict the connectivity pattern of disulfide predict the connectivity pattern of disulfide bonds?
Prediction of the connectivity of Prediction of the connectivity of disulfide bonds in proteins
-
Prediction of disulfide connectivity in proteins Bovine trypsin Inhibitor 6PTI
N5
5514
38N
51
30
C
30
5 14 30 38 51 55connectivity pattern
... Sequence
connectivity pattern
-
Prediction of disulfide connectivity in proteins as a problem of maximum-weight perfect matching
Representation:
C 2Protein sequence
p
Cys2W23
W13W12
q
Cys3Cys1W24
W13
W34
Cys4
W14W34
N Cys4C
The undirected weighted graph with V=2B vertices (no of cysteines) and E=2B(2B-1)/2 undirected edges (strength of the interaction W)
-
From the Graph Theory:
ll h
p y
•It is not necessary to compute all the possible connectivity patterns ( (i B) (2i-1))
•Given a complete graph G=(2B E)Given a complete graph G=(2B,E)the matching with the maximum weight
b t d i O((B)3) ti can be computed in a O((B)3) time with the Edmonds-Gabow’s algorithm*
* Gabow H N (1975) Technical Report CU-CS-075-75 Dept of Comp Gabow, H.N. (1975). Technical Report,CU-CS-075-75, Dept. of Comp. Sci. Colorado University
-
H t i th t (W) How to assign the costs (W) of the edges in the graphof the edges in the graph
Cys2W23
Cys2Cys2W23W23
Cys3Cys1
W23W13
W12Cys3Cys1 Cys3Cys1
W23W13
W12 W23W13
W12
W24W14
W34W24W14
W34W24W14
W34
Cys4N
C
Cys4Cys4N
C
N
CC C C
-
Assumption: for each cysteine all its sequence p y qnearest neighbours make contacts
neighbours (Ni)
All possible interactionsusing 1 nearest neighbour
Cys iCys i Cys j
Cys i
Cys j
neighb(N
j
CN
y j boursj)
-
Frequency distribution of disulfide bonds with respect to sequence separation (726 proteins)
16
respect to sequence separation (726 proteins)
14
16
10
12
%)
6
8
uenc
y(%
2
4
6
Freq
ue
0
2
0 50 100 150 200 250 300 350 400 450Sequence separation
-
Neural Networks for predicting the edge values
Neural Networks for predicting the edge values edge values edge values
Disulfide pair propensity (output = wij)
Output ( 1 d )( 1 node)
Hidden nodes(6 nodes)( )
I t Each pair in the neighbours of 4 residues Input(212 nodes)
Each pair in the neighbours of 4 residues + Sequence separation + No of SS bonds
(210 + 2 Input nodes)
-
Accuracy (Qp) of EG vs NN
Chains B Random EG NN
158 2 0.333 0.46 0.68
153 3 0.067 0.17 0.21
103 4 0 009 0 11 0 20103 4 0.009 0.11 0.20
44 5 0.001 0.00 0.02
-
The state of art:
•Prediction of bonding states is quite satisfactorysatisfactory
•Prediction of connectivity needs to bePrediction of connectivity needs to beimproved
-
P di ti f F ldP di ti f F ldPrediction of FoldonsPrediction of Foldons
Piero Fariselli
-
The Folding Problem as a Mapping The Folding Problem as a Mapping ProblemProblem
Covalent structureCovalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
3D structure
EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
3D structure
Ct
-
We can collect from the PDB data base some 1500 chains of known structures from which to derive non redundant information relating gsequence to:
• secondary structure
• structural and functional motifs
• 3D structure
-
1 Y K D Y H S - D K K K G E L - -
Evolutionary information2 Y R D Y Q T - D Q K K G D L - -
3 Y R D Y Q S - D H K K G E L - -4 Y R D Y V S - D H K K G E L - -5 Y R D Y Q F - D Q K K G S L - -6 Y K D Y N T - H Q K K N E S - -7 Y R D Y Q T D H K K A D L
information
•Multiple Sequence
MSA
7 Y R D Y Q T - D H K K A D L - -8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K
sequence position
Alignment (MSA) of similar sequences
M
A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0
•Sequence profile: for each position a 20-valued vector contains F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0
G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0
valued vector contains the aminoacidic composition of the li d p
rofil
e
M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0
aligned sequences.
quen
ce p
T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0
Se
-
Prediction of Initiation Sites of Protein FoldingPrediction of Initiation Sites of Protein Folding
The Early Stages of Folding:
g
The Folding ProcessThe Folding Process h Ear y Stag s of Fo ngInitiation SitesThe Unfolded ChainFolded Protein
-
Frustration in proteins
• The simultaneous minimisation of all the interaction energies is impossible
-
The network architecture
Output Non
p
Hidden
Input
..ALS.......QGFLLIARQPPFTYFTV......HW..
Input Window Q Q
-
The prediction efficiency of the network
Q2 = 0.85 Q(H) = 0.67 Q(nonH) = 0.93 Sovpred = 0.85
C = 0.63 Pc(H) = 0.80 Pc(nonH) = 0.86 Sovobs = 0.76
-
Theoretical background
The conformation of residue R depends both on local (window W)and non local (context C) interactions.
Context CContext C
Residue RWindow W
Neural Network
The convergence theorem ensures that:
O Onon The convergence theorem ensures that:
Oi = Probability ( StructureR= i | W )
If for any i O 1 then the structure of residue R dependsIf , for any i, Oi 1 , then the structure of residue R depends mainly on W and only slightly on C
-
R W C
P ( | ) ( ) W C ( W C)
R W C• Anfinsen’s hypothesis:
P ( | , ) ( , ) i i natW C ( W,C)
• Averaging over all the contexts (performed by NN):
C
P W W C P Ci i( | ) ( | , ) ( ) P
g g (p y )
C
• When the pattern is self-stabilising (W dependent):P ( | , )i W C P ( | )i W=
Th th A fi ’ h th i b t i l l f
• Then the Anfinsen’s hypothesis can be cast in a local form:
P W W C P Ci iC
i nat( | ) ( | , ) ( ) ( , (W) ) P
-
Relationship between the reliability index and the Shannon entropy
1
Shannon entropy
0.8
0.9
1
0.6
0.7
ity In
dex
0.4
0.5
Rel
iabi
li
0.2
0.3
0
0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Entropy (S5)
-
INPUT
MAS..... QLMLKDFLNRTPL.........GHI
......... ..........
O O non-
S Oi l OiS = i Oi log Oi_
-
Protein segments correctly predicted in -helical structure
100
80
100
40
60NC / NT (%)
0
20
13579
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
0
Entropy Segment length
Entropy = Shannon-entropy in (ln 2)/10 units ( S = -i o i ln ( o i ) )
9Segment length
Entropy Shannon entropy in (ln 2)/10 units ( S i o i ln ( o i ) )NC = Number of protein segments correctly predicted in -helixNT = Total number of protein segments predicted in -helix
-
Profile of the smoothed entropy (S5) for the hen egg lysozyme (132L)
0.7lysozyme (132L)
S5
0 5
0.6 EntropyPredicted helicesE t t d f t
0.4
0.5 Extracted fragments
0.3
0.1
0.2
01 11 21 31 41 51 61 71 81 91 101 111 121
Protein chain
-
Hen egg lysozyme (132L)gg y y ( )
C terminusN-terminus
C-terminus
-
Frequency distribution of predicted helical segments as a function of their entropy value
C
segments as a function of their entropy value
0 15
0.2 Frequency CorrectWrongDifferences
0.1
0.15 Differences
0.05
00.1 0.2 0.3 0.4 0.5 0.6 0.70.0
0 1
-0.05 Entropy (S5)Threshold value
-0.1
-
An example of the data base of minimally frustrated protein fragmentsprotein fragments
http://www.biocomp.unibo.it/DB/ CODE ENTROPY POSITIONS SEQUENCE DSSP SECONDARY STRUCTURE 1msk_ 0.002 192-206 ADRLAEAFAEYLHER HHHHHHHHHHHHHHH 1pyda 0.004 307-319 MKFVLQKLLTNIA HHHHHHHHHHHHH1ngr 0 005 63 72 LDALLAALRR HHHHHHHHHH 1ngr_ 0.005 63-72 LDALLAALRR HHHHHHHHHH 1sly_ 0.005 338-346 AKEILHQLM HHHHHHHH! 1aerb 0.006 20-28 VERLLQAHR HHHHHHHHH 1bcn_ 0.006 113-123 LENFLERLKTI HHHHHHHHHHH 1bib_ 0.006 215-226 LAAMLIRELRAA HHHHHHHHHHHH 1fkx_ 0.006 337-346 KKELLERLYR HHHHHHHHHH 2arcb 0.006 148-158 NLLEQLLLRRM HHHHHHHHHHH1aqt 0 008 112-125 DYAQASAELAKAIA HHHHHHHHHHHHHH 1aqt_ 0.008 112 125 DYAQASAELAKAIA HHHHHHHHHHHHHH 1fit_ 0.008 111-120 EEEXAAEAAA HHHHHHHHHH 1mtyg 0.009 22-30 LEKAAEMLK HHHHHHHHH 2tct_ 0.009 50-60 LLDALAVEILA HHHHHHHHHHH 1hsba 0.010 150-157 AHVAEQWR !!HHHHHH 2chsa 0.010 17-26 EEILQKTKQL HHHHHHHHHH 1hjp_ 0.011 175-184 ETLIREALRA HHHHHHHHH!1pou 0.011 5-13 LEQFAKTFK HHHHHHHHH 1pou_ 0.011 5 13 LEQFAKTFK HHHHHHHHH
..........................................................................................................................
-
Training set from PDBTraining set from PDB
Number of Number of Number of Averageproteins amino acids -helices
glength
822 174191 4783 116
Data base of minimally frustrated -helical segments
Number ofproteins
Number ofamino acids
Number of-helical segments
Averagelength
626 21553 3000 72
-
Comparison of minimally frustrated segments with t ti f ldi i iti ti it i t ll d t i dputative folding initiation sites experimentally determined
i i i fPDBCode
Entropy(S5)
Position in theprotein chain
ExtractedSequence
Reference
132l 0.109 8-14 LAAAMKR Radford & al. (1992)3 0. 09 8 Radford & al. (1992)132l 0.212 89-95 TASVNCA Radford & al. (1992)1hfx 0.186 86-91 TDDIMC Chyan & al. (1993)1hfx 0 221 7-13 ALSHELN *1hfx 0.221 7 13 ALSHELN *1hrc 0.156 92-99 EDLIAYLK Jeng & al. (1990)2mm1 0.050 127-132 AQGAMN Hugson & al. (1990)2mm1 0 104 139 146 RKDMASNY H & l (1990)2mm1 0.104 139-146 RKDMASNY Hugson & al. (1990)2mm1 0.154 105-111 EFISEAI Hugson & al. (1990)7rsa 0.409 8-11 FERQ Udgaonkar & Baldwin (1990)1 0 322 2 281ubq 0.322 25-28 NVKA Briggs & Roder (1992)1gf1 0.311 10-16 LVDALQF Hua & al. (1996)2ci2 0.236 14-19 VEEAKK Fersht (1995)
*Not yet experimentally detected
-
Comparison of minimally frustrated segments with peptides extracted from proteinspeptides extracted from proteins
Code* Peptides* % Helix insolution*
Entropy(S5)
ExtractedSegment
3FXC TYKVTELINEAEGINETIDCDD 1 ##### ####3LZM GFTNSLRMLQQKRWDEAVNLAKS 10 0.262 WDEAVNL
“ 10 0.329 LRMLQQK3LZM-2 GVAGFTNSLRMLQQKRWDEAAVNLAKS 12 0.203 SLRMLQ
“ 12 0 210 DEAAVNL 12 0.210 DEAAVNLCIII ESLLERITRKLRDGWKRLIDIL 8 0.171 LLERIT
“ 8 0.260 WKRLIDCIII-L ESLLERITRKL 15 0.171 LLERITCIII-R RDGWKRLIDIL 4 0.260 WKRLIDCIII-M RITRKLRDGWK 2 #### ####Sigma KVATTKAQRKLFFNLRKTKQRL 9 0.218 TKAQRKCOMA1 DHPAVMEGTKTILETDSNLS 4 #### ####COMA2 EPSEQFIKQHDFSSY 3 #### ####
3 6 0 189COMA3 VNGMELSKQILQENPH 6 0.189 LSKQILQCOMA4 EVEDYFEEAIRAGLH 20 0.020 YFEEAIRCOMA5 KEKITQYIYHVLNGEIL 3 #### ####ARA1 AVGKSNLLSRYARNEFSA 2 #### ####ARA2 RFRAVTSAYYRGAVG 3 #### ####ARA2 RFRAVTSAYYRGAVG 3 #### ####ARA3 TRRTTFESVGRWLDELKIHSD 7.5 0.194 SVGRWLARA4 AVSVEEGKALAEEEGLF 4 #### ####ARA5 STNVKTAFEMVILDIYNNV 3 #### ####G1 DTYKLILNGKTLKGETTTEA 2 #### ####G2 GDAATAEKVFKKIANDNGVD 4 #### ####G3 GEWTYDDATKTFTVTE 2 #### ####
* Muñoz and Serrano, 1994.
-
Minimally frustrated -helical segments are f l f d i iuseful for determining:
• Folding initiation sites
• -helix stability
• de-novo design of -helicesde novo design of helices
-
Structure prediction of membrane u u p f m mproteins
-
Outer Membrane proteinsOuter Membrane proteins(all -Transmembrane proteins)
Inner Membrane proteins( ll T b t i )(all -Transmembrane proteins)
-
Outer Membrane Inner Membrane-barrel -helices
Porin (Rhodobacter capsulatus)
Bacteriorhodopsin(Halobacterium salinarum)
-
Predictors of the Topology of Membrane Proteins
position of Trans Membrane Segments along the sequenceTopography
Lipidic Bilayer
Out
Lipidic Bilayer
+ Bilayer
N In+
+++ +
Topology
N
C
In
position of N and C termini with Topology position of N and C termini with respect to the bilayer
-
Prediction of transmembrane segments
-
TM nonTMNeural Network for the prediction of TMS in
b l b TM nonTM
2 output neurons-barrel membrane proteins. (Jacoboni et al., 2001)
5 hidden neurons
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 0 60 0 0 0 0 0 0 0 0 0 00 0 0 10 0 33 0 0 010 0 30 0 30 0 100 0 00 0 0 0 10 0 0 10 300 0 0 0 10 0 0 10 300 40 0 0 0 0 0 0 100 0 0 0 0 0 0 0 300 0 0 0 0 0 0 30 00 0 0 0 0 0 0 0 00 0 0 0 10 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 40 0 0 0 300 50 0 0 0 0 0 0 00 50 0 0 0 0 0 0 00 0 0 0 0 33 0 0 020 0 0 0 0 33 0 0 00 0 0 0 10 0 0 0 00 10 0 0 0 0 0 0 070 0 0 90 0 0 0 0 0
Window: 9 residues
-
A generic model for membrane proteins (TMHMM)A generic model for membrane proteins (TMHMM)End End
Transmembrane Inner SideOuter Side
Begin
-
Sequence-profile-based HMMSequence-profile-based HMMSequence of
0 0 0 0 8 90 3
..A C L P R P E T ...
t
Sequence of characters ct
90 08500
0000
00220
03400
8000
90000
3027
tSequence of
A di i l t st
90000
5000
40130
23005
02400
0000
0000
4086
A-dimensional vectors st0000 For proteins A=200
020
0405
50230
0000
0000
0000
6136
00100
For proteins A=20
Constraints8000
0600
30110
20220
09200
0000
55125
0 st (n) S t,nS=100
0000
Constraints
0000
02301
0020
01800
0000
00077
56172
S 100
k=1 st (n) = S tA
0000
00
440
110
00
00
023
26
k=1 t ( )00
nMartelli et al., Bioinformatics 18, S46-53, 2002
-
The new algorithms make possible:
•to feed HMMs with sequence profiles•to feed HMMs with sequence profiles
•to eventually couple NNs and HMMs (Hidden u y up NN HMM (HNeural Networks)
dAdvantages:
•Higher performance than standard HMMs•Higher performance than standard HMMs
•Increased discrimination capability of a given p y f gclass
Martelli et al., Bioinformatics, 2002Martelli et al., Bioinformatics, 2002Martelli et al., Protein Eng. 2002,
-
Prediction of the Topology of -Transmembrane Proteins
position of Trans Membrane Helices along the sequenceTopography
The prediction accuracy of topography is 92%
OutThe prediction accuracy of topology is 81 %
+ Bilayer
N In+
+++ +
Topology
N
C
In
position of N and C termini with Topology position of N and C termini with respect to the bilayer
-
Prediction of the Topology of -Transmembrane Proteins
position of Transmembrane Strands along the sequenceTopography:
The prediction accuracy of topography is 73 %
Th p di ti n f t p l i 73 %LPS (Out)
The prediction accuracy of topology is 73 %
+ Bilayer
N
+++
+ +
Topology:
N
CPeriplasmic (In)
position of N and C termini with Topology: position of N and C termini with respect to the bilayer
-
100
The discriminative capability of the HMM model
90
100
70
80
e
Outer membrane
Globular
50
60
rcen
tage Inner membrane
30
40Per
10
20
02.75 2.8 2.85 2.9 2.95
I(s | M) = -1/L log P(s | M)
-
An application: modeling the 3D structure of eukaryotic barrel
proteinsproteins
-
3D structure predicti n f pr teins3D structure prediction of proteins
New folds Existing foldsNew folds Existing foldsMembrane proteins
Threading/ Ab initio Building by h lfold
recognitionprediction homology
0 10 20 30 40 50 60 70 80 90 100 Homology (%)
-
Structural alignment of VDAC with the template 2omf_.seq/ AEIYNKDGNK VDLYGKAVGL HYFSKGNGEN SYGGNGDMTY ARLGFKGETQ 2omf_.str/ CCCCCCCCEE EEEEEEEEEE EEECCCCCCC CCCCCCCCCE EEEEEEEEEE protx.str/ *******CCC CCCCEEEEEE EEEC****** ********CE EEEEEEEECC protx.seq/ *******KGY NFGLWKLDLK TKTS****** ********SG IEFNTAGHSN 2omf_.seq/ I*NSDLTGYG QWEYNFQGNN SEGADAQTGN KTRLAFAGLK YADVGSFDYG 2omf_.str/ C*CCCEEEEE EEEEEEECCC CCCCCCCCCC EEEEEEEEEE ECCCEEEEEE protx.str/ CCCCCEEEEE EEEEEEC*** ********** EEEEEEEEEC CCCCCEEEEE protx.seq/ QESGKVFGSL ETKYKVK*** ********** DYGLTLTEKW NTDNTLFTEV 2omf_.seq/ RNYGVVYDAL GYTDMLPEFG GDTAYSDDFF VGRVGGVATY RNSNFFGLVD_ q2omf_.str/ ECCCCCCCCC CCCCCCCCCC CCCCCCCCCC CCCCCCEEEE EECCCCCCCC protx.str/ EEEECC**** ********** ********** **CCEEEEEE EEECCCCCCC protx.seq/ AVQDQL**** ********** ********** **LEGLKLSL EGNFAPQSGN 2omf_.seq/ GLNFAVQYLG KNER****** *********D TARRSNGDGV GGSISYEYE* 2omf_.str/ CEEEEEEEEC CCCC****** *********C CCCCCCCCEE EEEEEEEEC* protx str/ EEEEEEEEEE EEEECCCCCC CCCCCCCEEE EEEEEEEEEE EEEEEEECCCprotx.str/ EEEEEEEEEE EEEECCCCCC CCCCCCCEEE EEEEEEEEEE EEEEEEECCCprotx.seq/ KNGKFKVAYG HENVKADSDV NIDLKGPLIN ASAVLGYQGW LAGYQTAFDT 2omf_.seq/ **GFGIVGAY GAADRTNLQE AQPLGNGKKA EQWATGLKYD ANNIYLAANY 2omf_.str/ **CEEEEEEE EEEECCCCCC CCCCCCCCEE EEEEEEEEEE ECCEEEEEEE protx.str/ CCEEEEEEEE EEEEEEEEEE EEECCCCCCC EEEEEEEEEE CEEEEEEEEE protx.seq/ QQSKLTTNNF ALGYTTKDFV LHTAVNDGQE FSGSIFQRTS DKLDVGVQLSp q QQ Q Q Q 2omf_.seq/ GETRNATPIT NKFTNTSGFA NKTQDVLLVA QYQFDFGLRP SIAYTKSKAK 2omf_.str/ EEEECCCCCC CCCCCCCCCC CEEEEEEEEE EEECCCCEEE EEEEEEEEEE protx.str/ EEECC***** ********** *CCCEEEEEE EEECCCCEEE EEEEEEC*** protx.seq/ WASGT***** ********** *SNTKFAIGA KYQLDDDARV RAKVNNA*** 2omf seq/ DVEGIGDVDL VNYFEVGATY YFNKNMSTYV DYIINQIDSD NKLGVGSDDT2omf_.seq/ DVEGIGDVDL VNYFEVGATY YFNKNMSTYV DYIINQIDSD NKLGVGSDDT2omf_.str/ CCCCCCCEEE EEEEEEEEEE ECCCCEEEEE EEEEECCCCC CCCCCCCCCE protx.str/ *********E EEEEEEEEEE EC***EEEEE EEEEECCC** *****CCCCE protx.seq/ *********S QVGLGYQQKL RT***GVTLT LSTLVDGK** *****NFNAG 2omf_.seq/ VAVGIVYQF* *** 2omf .str/ EEEEEEEEE* ***_ /protx.str/ EEEEEEEEEE EC* protx.seq/ GHKIGVGLEL EA*
-
A low resolution 3D Model of VDAC h f N )the sequence from Neurospora crassa)
Ca
-
A low resolution 3D model of VDAC:l f d dlocation of mutated residues
Casadio et al., FEBS Lett 520:1-7 (2002) , ( )
-
Predictors of membrane protein pstructures can be used to filter genomes and find new membrane genomes and find new membrane
proteins without sequence homologoueshomologoues
-
FISHING NEW OUTER FISHING NEW OUTER MEMBRANE PROTEINS IN
GRAM-NEGATIVE BACTERIAMEMBRANE PROTEINS IN
GRAM-NEGATIVE BACTERIAGRAM-NEGATIVE BACTERIAGRAM-NEGATIVE BACTERIA
-
Proteins have intrinsic signals that govern their g gtransport and localization in the cell: a secretion hydrophic marker (or signal peptide)
Signal peptides in protein sequences:MRAKLLGIVLTTPIAISSFASTETLSFTPDNINADISLGTLSGKTKERVYLAEEGGRKVSQLDWKFNNAAIIKGAINWDLMPQISIGAAGWTTLGSRGGNMVDQDWMDSSNPGTWTDESRHPDTQLNYANEFDLNIKGWLLNEPNYRLGLMAGYQESRYSFTARGGSYIYSSEEGFRDDIGSFPNGER
g p p p q
AIGYKQRFKMPYIGLTGSYRYEDFELGGTFKYSGWVESSDNDEHYDPGKRITYRSKVKDQNYYSVAVNAGYYVTPNAKVYVEGAWNRVTNKKGNTSLYDHNNNTSDYSKNGAGIENYNFITTAGLKYTF
Sequences of outer membrane proteins have signal peptides:
th ti k i l k f t the secretion marker is also a marker of outer membrane proteins
-
Signal Peptide prediction Signal Peptide prediction
Signal Pepetide Mature protein
MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF
g p p
MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF
Cleavage siteg
-
2 Neural Networs
MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF
SignalNet CleavageNet
Predicts if a given id i i
Predicts if a given residue position belongs to the Signal Pepetide
f gresidue position is the cleavage site
Signal Pepetide
-
SignalNet Accuracy
Organism Window C Q2
Eukaryotes 15-1-15 0.83 0.95 Gram positive 15-1-15 0.79 0.92Gram negative 11-1-11 0 78 0 92Gram negative 11-1-11 0.78 0.92
-
CleavageNet AccuracyCleavageNet Accuracy
Organism Window C Q2
Eukaryotes 15-1-2 0.61 0.97Gram positive 20-1-3 0.56 0.96Gram negative 11-1-2 0 62 0 96Gram negative 11-1-2 0.62 0.96
-
Comparison with SignalP
Organism SignalP SPEP
Eukaryotes (+) 0.99 0.97 Eukaryotes (-) 0.85 0.94
Prokaryotes(+) 0.99 0.97Prokaryotes ( ) 0 93 0 96Prokaryotes (-) 0.93 0.96
Escherichia coli(+/-) 0.95 0.96( )
-
Performance of SignalNN on 2160 annotated proteins
PredictionWithout T t lWith
250
W thoutsignal Total
W thsignal
th nal
205 45
Q2 = 96 %
Q 82 %250
tati
onou
tal
Wit
sign 205 45 Qsignal = 82 %
Qnon-signal = 97 %
Ann
otW
itho
sign
aal
1910185555 Psignal = 78 %Pnon-signal = 98 %
2160Tota
l
260 1900non signal
Correct predictions
Wrong predictionsWrong predictions
-
Predictors of Membrane Topography: Rate f f l i iof false positives
The predictors are tested on on 809 globular protein with sequence identity 25 % :with sequence identity 25 % :
0.5 % have at least 1 -TM helix predicted. p
5.6 % have at least 2 -TM strand predicted
-
PROTEOME
Signal peptide
HUNTERSignal peptide
All- TM All- TM
All- TM
all -TM all -TMall -TMGlobular Globular
-
Predicting globular, inner and outer membrane proteins in genomes of Gram-negative bacteria with Hunter
Organism Outer membrane
Inner membrane
Globular Total
Escherichia coli K12 65 (1.6%) 907 (21.7%) 3201 (76.7%) 4173 New*
( )18
( )136
( )1099 1253
Escherichia coli O157:H7 New
78 (1.5%)10
1034 (19.3%)327
4249 (79.2%)1564
53611901
Chlamidia pneumoniae CWL029 12 (1.1%) 290 (27.6%) 750 (71.3%) 1052Chlamidia pneumoniae CWL029 New
12 (1.1%)2
290 (27.6%)181
750 (71.3%)236
1052419
Salmonella typhimurium LT2 New
70 (1.6%)0
1002 (22.5%)2
3379 (75.9%)21
445123
Neisseria meningitidis MC58 34 (1.7%) 372 (18.4%) 1619 (80.0%) 2025Neisseria meningitidis MC58 New
34 (1.7%)6
372 (18.4%)176
1619 (80.0%)662
2025844
Helicobacter pylori 26695 New
36 (2.3%)10
352 (22.5%)141
1178 (75.2%)445
1566596
Haemophylus influentiae Rd 23 (1 3%) 348 (20 4%) 1338 (78 3%) 1709Haemophylus influentiae Rd New
23 (1.3%)5
348 (20.4%)121
1338 (78.3%)430
1709556
Thermotoga maritima New
18 (1.0%)11
370 (20.0%)203
1458 (79.0%)559
1846773
Pseudomonas aeruginosa 131 (2 4%) 1292 (23 2%) 4142 (74 4%) 5565Pseudomonas aeruginosa New
131 (2.4%)62
1292 (23.2%)616
4142 (74.4%)1867
55652545
* the number of new proteins predicted in the class with Hunter, out of the non annotated regionnon-annotated region