prediction of structural and functional features in proteins startin … neurali... · 2012. 9....

113
Prediction of structural and functional features in proteins startin from the features in proteins starting from the residue sequence INTRODUCTION TO NEURAL NETWORKS NETWORKS

Upload: others

Post on 17-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Prediction of structural and functional features in proteins startin from the features in proteins starting from the

    residue sequence

    INTRODUCTION TO NEURAL NETWORKSNETWORKS

  • MAPPING PROBLEMS: Secondary structure

    Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

    Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

    3D structure

    .. .... . ...........

    3D structure

    Ct

  • MAPPING PROBLEMS: Topology of transmembrane proteins

    position of Trans Membrane Segments along the sequenceTopographyALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK

    proteins

    Outer Membrane Inner Membrane-barrel -helices

    Outer Membrane Inner Membrane

    Porin (Rhodobacter capsulatus)

    Bacteriorhodopsin(Halobacterium salinarum)

  • First generation methodsFirst generation methodsSingle residue statisticsSingle residue statistics

    Propensity scales

    F h id For each residue

    •The association between each residue and the different The association between each residue and the different features is statistically evaluated

    •Physical and chemical features of residues

    A it l f t t b i t d A propensity value for any structure can be associated to any residue

    HOW?

  • Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale

    Gi s t f k st t s t h mGiven a set of known structures we can count how manytimes a residue is associated to a structure.

    Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh

    N(A h) 7 N(A c) 1 N 40N(A,h) = 7, N(A,c) =1, N= 40

    P(A h) = 7/40 P(A h) = 1/40P(A,h) 7/40, P(A,h) 1/40

    Is that enough for estimating a propensity?

  • Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale

    Gi s t f k st t s t h mGiven a set of known structures we can count how manytimes a residue is associated to a structure.

    Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh

    N(A h) 7 N(A c) 1 N 40N(A,h) = 7, N(A,c) =1, N= 40

    P(A h) = 7/40 P(A h) = 1/40P(A,h) 7/40, P(A,h) 1/40

    We need to estimate how much independent the residue-to-structure association is.

    P(h) = 27/40 P(c) = 13/40P(h) = 27/40, P(c) = 13/40

  • Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale

    Given a set of known structures we can count how manyGiven a set of known structures we can count how manytimes a residue is associated to a structure.

    Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALh h h h h hhhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh

    N(A h) = 7 N(A c) =1 N= 40N(A,h) = 7, N(A,c) =1, N= 40

    P(A,h) = 7/40, P(A,h) = 1/40( , ) , ( , )

    P(h) = 27/40, P(c) = 13/40

    If the structure is independent of the residue:P(A h) = P(A)P(h)P(A,h) = P(A)P(h)

    The ratio P(A,h)/P(A)P(h) is the propensity

  • Gi LARGE s t f x m l s sit l b

    Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale

    Given a LARGE set of examples, a propensity value can becomputed for each residue and each structure type

    Name P(H) P(E) Alanine 1,42 0,83Arginine 0,98 0,93Aspartic Acid 1,01 0,54Asparagine 0,67 0,89Cysteine 0,70 1,19Glutamic Acid 1,51 0,37Glutamine 1,11 1,10Glycine 0,57 0,75Histidine 1,00 0,87Isoleucine 1,08 1,60, ,Leucine 1,21 1,30Lysine 1,14 0,74Methionine 1,45 1,05Phenylalanine 1 13 1 38Phenylalanine 1,13 1,38Proline 0,57 0,55Serine 0,77 0,75Threonine 0,83 1,19Tryptophan 1 08 1 37Tryptophan 1,08 1,37Tyrosine 0,69 1,47Valine 1,06 1,70

  • Gi s s d st t di ti

    Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale

    Given a new sequence a secondary structure prediction canbe obtained by plotting the propensity values for eachstructure residue by residuestructure, residue by residue

    T S P T A E L M R S T GT S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75

    Considering three secondary structures (H,E,C), the overall accuracy, as evaluated on an uncorrelated set of sequences with known structure, is very lowQ3 = 50/60 %

  • http://www expasy ch/cgi-bin/protscale pl

    Secondary structure: ChouSecondary structure: Chou--Fasman propensity scaleFasman propensity scale

    http://www.expasy.ch/cgi bin/protscale.pl

  • Transmembrane alphaTransmembrane alpha--helices: Kytehelices: Kyte--Doolittle scaleDoolittle scale

    It is m t d t ki i t sid ti th t l tIt is computed taking into consideration the octanol-waterpartition coefficient, combined with the propensity of theresidues to be found in known transmembrane helicesresidues to be found in known transmembrane helices

    Ala: 1.800 Arg: -4.500 Asn: -3.500 Asp: -3.500Asn: 3.500 Asp: 3.500 Cys: 2.500 Gln: -3.500 Glu: -3.500 Gly: -0.400 His: -3.200 Ile: 4.500 Leu: 3.800 Lys: -3.900 Met: 1.900 Phe: 2.800 Pro: -1.600 Ser: -0.800 Thr: -0.700 Trp: -0.900 Tyr: -1.300 Val: 4.200

  • Second generation methods: GORSecond generation methods: GOR

    The structure of a residue in a protein strongly depends on the sequence contexton the sequence context

    It is possible to estimate the influence of a residue in determining the structure of a residue close along the sequence. Usually windows from -8/8 to -13/13 are consideredconsidered.

    Coefficients P(A s i) estimate the contribution of the Coefficients P(A,s,i) estimate the contribution of the residue A in determining the structure s for a residue that is i positions apart along the sequence

  • Struttura secondaria: Metodo GORStruttura secondaria: Metodo GOR

    Q3 = 65 % (Considering three secondary structures (H,E,C), and evaluating the overall accuracy on an uncorrelated set of sequences with known structure)

    The contribution of each position in the window is independent of the other ones No correlation among the independent of the other ones. No correlation among the positions in the window is taken in to account.

  • A more efficient method: Neural NetworksA more efficient method: Neural Networks

    Alternative computing algorithm: analogies with theAlternative computing algorithm: analogies with thecomputation in the nervous system.

    1) The nervous systems is constituted of elementarycomputing units: neurons2) The electric signal flows in a determined direction2) The electric signal flows in a determined direction(dentrites->axon) (Principle of dynamic polarization)3)There is not cytoplasmic continuity among the neurons.3)There is not cytoplasmic continuity among the neurons.Each neuron specifically communicates with someneighboring neurons by means of synapses (Principle ofconnective specificity)

  • Tools out of machine learning approachesTools out of machine learning approachesNeural Networks can learn the mapping fromsequence to secondary structureNeural Networks can learn the mapping fromsequence to secondary structure

    PredictionN

    q yq y

    Data Base SubsetTraining

    New sequenceData Base SubsetTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

    General lrules

    PredictionEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE PredictionKnown mapping

  • Neural network for secondary structure predictionNeural network for secondary structure prediction

    Output

    C

    Output

    Inputp

    M P I L K QK P I H Y H P N H G E A K GA 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 0 0 0 0ll C 0 0 0 0 0 0 0 0 0D 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0 0 0 0G 0 0 0 0 0 0 0 0 0

    Usually:Input 17-23 residues

    H 0 0 0 1 0 1 0 0 1I 0 0 1 0 0 0 0 0 0K 1 0 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0

    Hidden neurons :4-15

    N 0 0 0 0 0 0 0 1 0P 0 1 0 0 0 0 1 0 0Q 0 0 0 0 0 0 0 0 0R 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 0 0 0 0T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0W 0 0 0 0 0 0 0 0 0Y 0 0 0 0 1 0 0 0 0

  • D (L)

    R (E)

    ACDEF

    Q (E)

    G (E)

    ( )FGHIKLM

    HF (E)

    V (E)

    P (E)MNPQRS

    E

    L

    P (E)

    A (H)

    A (H)TVWY.

    A (H)

    Y (H)

    V (E)V (E)

    K (E)

    K (E)( )

  • Third generation methods: evolutionary informationThird generation methods: evolutionary information

    1 Y K D Y H S - D K K K G E L - -2 Y R D Y Q T - D Q K K G D L - -3 Y R D Y Q S - D H K K G E L - -4 Y R D Y V S - D H K K G E L - -5 Y R D Y Q F - D Q K K G S L - -Q Q6 Y K D Y N T - H Q K K N E S - -7 Y R D Y Q T - D H K K A D L - -8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K10 T K G Y G F G L I K N T E T T K

    A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0

    Position

    E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

  • The Network Architecture for Secondary Structure PredictionThe Network Architecture for Secondary Structure PredictionSecondary Structure PredictionSecondary Structure Prediction

    The First Network (Sequence to Structure)The First Network (Sequence to Structure)

    H E C

    CCHHEHHHHCHHCCEECCEEEEHHHCC

    SeqNo No V L I M F W Y G A P S T C H R K Q E N D

    1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 803 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 637 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 09 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0

    10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 011 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 013 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0

  • The Network Architecture for Secondary Structure PredictionThe Network Architecture for Secondary Structure PredictionSecondary Structure PredictionSecondary Structure Prediction

    The Second Network (Structure to Structure)The Second Network (Structure to Structure)H E C

    CCHHEHHHHCHHCCEECCEEEEHHHCC

    SeqNo No V L I M F W Y G A P S T C H R K Q E N D

    1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 803 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 637 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 09 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0

    10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 011 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 013 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0

  • The Performance on the Task of S d S P di iThe Performance on the Task of S d S P di i

    The cross validation procedureThe cross validation procedure

    Secondary Structure PredictionSecondary Structure Prediction

    Protein set

    The cross validation procedureThe cross validation procedure

    Testing set 1

    Training set 1

  • Efficiency of the Neural Network-Based Predictors onthe 822 Proteins of the Testing Set

    INPUTQ3 (%) 66.3

    Single SOV 0.62Sequence Q[H] 0 69 Q[E] 0 61 Q[C] 0 66Sequence Q[H] 0.69 Q[E] 0.61 Q[C] 0.66

    P[H] 0.70 P[E] 0.54 P[C] 0.71C[H] 0.54 C[E] 0.44 C[C] 0.45

    Q3(%) 72 4Q3(%) 72.4Multiple SOV 0.69Sequence Q[H] 0.75 Q[E] 0.65 Q[C] 0.75(MaxHom) P[H] 0 77 P[E] 0 64 P[C] 0 73(MaxHom) P[H] 0.77 P[E] 0.64 P[C] 0.73

    C[H] 0.64 C[E] 0.54 C[C] 0.53Q3(%) 73.4

    Multiple SOV 0 70Multiple SOV 0.70Sequence Q[H] 0.75 Q[E] 0.70 Q[C] 0.73(PSI-BLAST) P[H] 0.80 P[E] 0.63 P[C] 0.75

    C[H] 0 67 C[E] 0 56 C[C] 0 53C[H] 0.67 C[E] 0.56 C[C] 0.53

    Combinando differenti reti: Q3 =76/78%

  • Secondary Structure PredictionSecondary Structure Prediction

    From sequenceFrom sequence

    TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

    EEEE HHHHHHHHHHHH HHHHHHHH EEEE

    To secondary structureTo secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

    And to the reliability of the predictionAnd to the reliability of the prediction

    7997688899999988776886778999887679956889999999

  • P di tP t i B kh d R (C l bi U i )

    SERVERSSERVERSPredictProtein Burkhard Rost (Columbia Univ.)

    http://cubic.bioc.columbia.edu/predictprotein/

    P iPRED D id J (UCL)PsiPRED David Jones (UCL)http://bioinf.cs.ucl.ac.uk/psipred/

    JPred Geoff Barton (Dundee Univ.)

    SecPRED http://www.biocomp.unibo.it

  • QEALEIA

    Chamaleon sequencesChamaleon sequences

    QEALEIA

    GIKSKQEALEIAARRN FNPQTQEALEIAPSVGV

    Translation Initiation Factor 3

    ……GIKSKQEALEIAARRN……

    Transcription Factor 1

    ……FNPQTQEALEIAPSVGV……

    Translation Initiation Factor 3Bacillus stearothermophilus

    Transcription Factor 1Bacteriophage Spo1

    1WTUA

    1TIF

  • We extract: We extract:

    from a set of 822 non-homologous proteins

    2 452 5 h l s

    (174,192 residues)2,452 5-mer chameleons

    107 6-mer chameleons16 7-mer chameleons

    1 8-mer chameleon

    2,576 couples, p

    The total number of residues in chameleons is 26,044 out of 755 protein chains (~15%)

  • Prediction of the Secondary Structure of Prediction of the Secondary Structure of Chameleon sequences with Neural NetworksChameleon sequences with Neural Networks

    QEALEIAHHHHHHH

    QEALEIACCCCCCC

    C C

    HHHHHHH CCCCCCC

    NGDQLGIKSKQEALEIAARRNLDLVLVAP ARKGFNPQTQEALEIAPSVGVSVKPGNGDQLGIKSKQEALEIAARRNLDLVLVAP ARKGFNPQTQEALEIAPSVGVSVKPG

  • The Prediction of Chameleons with Neural N t k

    The Prediction of Chameleons with Neural N t kNetworksNetworks

    Method Performance on the Performance onMethod Performance on theProtein data set

    Performance onChameleon sequences

    NN with MSA Input 73.4 % 75.1 %pNN with SS Input 66.3 % 58.9 %GOR IV 64.4% 55.2 %

  • Other neural networkOther neural network--based predictorsbased predictors

    •Secondary structure

    •Topology of transmebrane proteins

    •Cysteine bonding state

    •Contact maps of proteins

    •Interaction sites on protein surface

  • Prediction of the cysteine bonding statePrediction of the cysteine bonding state

    T d i I f C i hidi f i l (1QK8)Tryparedoxin-I from Crithidia fasciculata (1QK8)MSGLDKYLPGIEKLRRGDGEVEVKSLAGKLVFFYFSASWCPPCRGFTPQLIEFYDKFHES KNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQKLSKHFNVESIPTLIGVDADSGKNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQKLSKHFNVESIPTLIGVDADSG DVVTTRARATLVKDPEGEQFPWKDAP

    Cys68

    Free cysteines

    y

    Cys40

    Disulphide bonded cysteines

    y

    Cys43

  • A l t k b d A l t k b d A neural network-based method for predicting

    A neural network-based method for predicting method for predicting

    the disulfide connectivity method for predicting

    the disulfide connectivity yin proteins

    yin proteins

  • The Protein Folding

    T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N

  • The Protein FoldingRPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA

  • Disulfide bonds Disulfide bonds

    SS

    SC

    CSC CC

    2-SH -> -SS- + 2H+ + 2e-

    S-S distance 2.2 Å

    T si n n l C S S C 90°Torsion angle C-S-S-C 90°

    Bond Energy 3 Kcal/mol

  • Intra-chain disulfide bonds in proteins

    Of 1259 proteins (a non redundant PDB subset):

    • 23% of the chainshave disulfide bonds (S S)

    • SS distribution (between secondary structures)secondary structures)

    % H E CH 7 9 14E 17 27C 26

  • Intra-chain disulfide bonds in proteins

    Distribution of disulfide bonds in the SCOP domains

    •99 % of the disulfide bonds are intra-domain

    •Distribution:Type %AllAll- 13All- 31 / 11 / 11 + 13Small domains 29Others 3

  • Problem no 1:

    f h

    Problem no 1:

    Starting from the protein sequence can we discriminate whether a cysteine residue is

    d lf d b d d?disulfide-bonded?

    Prediction of the disulfide-bonding f i i istate of cysteines in proteins

  • Perceptron (input: sequence profile)Perceptron (input: sequence profile)

    bonded Non bondedbonded Non bonded

    NGDQLGIKSKQEALCIAARRNLDLVLVAP

  • Plotting the trained weigthsPlotting the trained weigths

    Hinton’s plot V L I M F W Y G A P S T C H R K Q E N D 0 & #

    Residue

    Residue

    -5-4-3-2on

    bonding state

    -2-1 0 12

    Posi

    tio

    V L I M F W Y G A P S T C H R K Q E N D 0 & #

    2 3 4 5

    non bonding

    -5-4-3-2nbonding

    state2-1 0 12os

    itio

    n

    2 3 4 5

    Po

  • It is possible to add a sintax?It is possible to add a sintax?

    Begin

    1 2

    Bonded statesFree states

    3 4

    o ded states

    End

  • A pathA path

    Bonding Begin

    ondingResidue State StateC401 2C43C68C68

    3 4

    End

  • A pathA path

    Bonding Begin

    ondingResidue State StateC40 1 F1 2C43C68C68

    3 4P(seq) = P(1 | Begin) P(C40 | 1) ...

    End

  • A pathA path

    Bonding Begin

    ondingResidue State StateC40 1 F1 2C43 2 BC68C68

    3 4P(seq) = P(1 | Begin) P(C40 | 1) ...

    P(2 | 1) P(C43 | 2) ..

    End

  • A pathA path

    Bonding Begin

    ondingResidue State StateC40 1 F1 2C43 2 BC68 4 BC68 4 B

    3 4P(seq) = P(1 | Begin) P(C40 | 1) ...

    P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4)

    End P(4 | 2) P(C68 | 4) ..

  • A pathA path

    Bonding Begin

    ondingResidue State StateC40 1 F1 2C43 2 BC68 4 BC68 4 B

    3 4P(seq) = P(1 | Begin) P(C40 | 1) ...

    P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4)

    End P(4 | 2) P(C68 | 4) .. P(End | 4)

  • 4 possible paths4 possible paths

    Begin

    Bonding Residue State State

    Begin

    1 2

    Bonding Residue State State

    1 2 C40 1 FC43 2 BC68 4 B

    1 2 C40 2 BC43 3 FC68 4 B

    E d

    43C68 4 B

    End

    43C68 4 B

    Begi

    End

    B i

    End

    Begin

    1 2

    Bonding Residue State StateC40 1 F

    Begin

    1 2

    BondingResidue State StateC40 2 B

    43

    C43 1 FC68 1 F

    43

    C43 4 BC68 1 F

    End End

  • W1 W2 W3

    Hybrid systemHybrid system

    MYSFPNSFRFGWSQAGFQCEMSTPGSEDPNTDWYKWVHDPENMAAGLCSGDLPENGPGYWGNYKTFHDNAQKMCLKIARLNVEWSRIFPNP...

    P(B|W1), P(F|W1) P(B|W3), P(F|W3)P(B|W2), P(F|W2)

    Begin

    Free Cys

    Bonded Cys

    EndEnd

    Viterbi path

    Prediction of bonding state of cysteines

  • Prediction for TriparedoxinPrediction for Triparedoxin

    R idResidue C40C43C43C68

  • Prediction for TriparedoxinPrediction for Triparedoxin

    NN Output NN predR id B FResidue B FC40 99 1 BC43 82 18 BC43 82 18 BC68 61 39 B

  • Prediction for TriparedoxinPrediction for Triparedoxin

    Begin

    1 21 2

    End

    43

    NN Output NN pred HMM HMM predR id B F Vit bi th

    End

    Residue B F Viterbi pathC40 99 1 B 2 BC43 82 18 B 4 BC43 82 18 B 4 BC68 61 39 B 1 F

  • PerformancePerformance

    )Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot

    Neural Network

    Q Q( ) Q( ) ( ) ( ) Q pWD 80.4 0.56 67.2 87.5 74.3 83.2 56.9 RD 80.1 0.56 67.2 87.6 75.7 82.2 49.7

    )Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot

    Hybrid systemQ Q( ) Q( ) ( ) ( ) Q p

    WD 88.0 0.73 78.1 93.3 86.3 88.8 84.0 RD 87.4 0.73 78.1 92.8 86.3 88.0 80.2

    B= cysteine bonding state, F=cysteine free state. WD= whole database (969 proteins, 4136 cysteines) RD= Reduced database, in which the chains containing only one cysteine are removed (782 proteins, 3949 cysteines).

    Martelli PL, Fariselli P, Malaguti L, Casadio R. -Prediction of the disulfide bonding state ofcysteines in proteins with hidden neural networks- Protein Eng. 15:951-953 (2002)

  • Problem no 2:

    When the bonding state of cysteines is known When the bonding state of cysteines is known can we

    predict the connectivity pattern of disulfide predict the connectivity pattern of disulfide bonds?

    Prediction of the connectivity of Prediction of the connectivity of disulfide bonds in proteins

  • Prediction of disulfide connectivity in proteins Bovine trypsin Inhibitor 6PTI

    N5

    5514

    38N

    51

    30

    C

    30

    5 14 30 38 51 55connectivity pattern

    ... Sequence

    connectivity pattern

  • Prediction of disulfide connectivity in proteins as a problem of maximum-weight perfect matching

    Representation:

    C 2Protein sequence

    p

    Cys2W23

    W13W12

    q

    Cys3Cys1W24

    W13

    W34

    Cys4

    W14W34

    N Cys4C

    The undirected weighted graph with V=2B vertices (no of cysteines) and E=2B(2B-1)/2 undirected edges (strength of the interaction W)

  • From the Graph Theory:

    ll h

    p y

    •It is not necessary to compute all the possible connectivity patterns ( (i B) (2i-1))

    •Given a complete graph G=(2B E)Given a complete graph G=(2B,E)the matching with the maximum weight

    b t d i O((B)3) ti can be computed in a O((B)3) time with the Edmonds-Gabow’s algorithm*

    * Gabow H N (1975) Technical Report CU-CS-075-75 Dept of Comp Gabow, H.N. (1975). Technical Report,CU-CS-075-75, Dept. of Comp. Sci. Colorado University

  • H t i th t (W) How to assign the costs (W) of the edges in the graphof the edges in the graph

    Cys2W23

    Cys2Cys2W23W23

    Cys3Cys1

    W23W13

    W12Cys3Cys1 Cys3Cys1

    W23W13

    W12 W23W13

    W12

    W24W14

    W34W24W14

    W34W24W14

    W34

    Cys4N

    C

    Cys4Cys4N

    C

    N

    CC C C

  • Assumption: for each cysteine all its sequence p y qnearest neighbours make contacts

    neighbours (Ni)

    All possible interactionsusing 1 nearest neighbour

    Cys iCys i Cys j

    Cys i

    Cys j

    neighb(N

    j

    CN

    y j boursj)

  • Frequency distribution of disulfide bonds with respect to sequence separation (726 proteins)

    16

    respect to sequence separation (726 proteins)

    14

    16

    10

    12

    %)

    6

    8

    uenc

    y(%

    2

    4

    6

    Freq

    ue

    0

    2

    0 50 100 150 200 250 300 350 400 450Sequence separation

  • Neural Networks for predicting the edge values

    Neural Networks for predicting the edge values edge values edge values

    Disulfide pair propensity (output = wij)

    Output ( 1 d )( 1 node)

    Hidden nodes(6 nodes)( )

    I t Each pair in the neighbours of 4 residues Input(212 nodes)

    Each pair in the neighbours of 4 residues + Sequence separation + No of SS bonds

    (210 + 2 Input nodes)

  • Accuracy (Qp) of EG vs NN

    Chains B Random EG NN

    158 2 0.333 0.46 0.68

    153 3 0.067 0.17 0.21

    103 4 0 009 0 11 0 20103 4 0.009 0.11 0.20

    44 5 0.001 0.00 0.02

  • The state of art:

    •Prediction of bonding states is quite satisfactorysatisfactory

    •Prediction of connectivity needs to bePrediction of connectivity needs to beimproved

  • P di ti f F ldP di ti f F ldPrediction of FoldonsPrediction of Foldons

    Piero Fariselli

  • The Folding Problem as a Mapping The Folding Problem as a Mapping ProblemProblem

    Covalent structureCovalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

    Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

    3D structure

    EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

    3D structure

    Ct

  • We can collect from the PDB data base some 1500 chains of known structures from which to derive non redundant information relating gsequence to:

    • secondary structure

    • structural and functional motifs

    • 3D structure

  • 1 Y K D Y H S - D K K K G E L - -

    Evolutionary information2 Y R D Y Q T - D Q K K G D L - -

    3 Y R D Y Q S - D H K K G E L - -4 Y R D Y V S - D H K K G E L - -5 Y R D Y Q F - D Q K K G S L - -6 Y K D Y N T - H Q K K N E S - -7 Y R D Y Q T D H K K A D L

    information

    •Multiple Sequence

    MSA

    7 Y R D Y Q T - D H K K A D L - -8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K

    sequence position

    Alignment (MSA) of similar sequences

    M

    A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0

    •Sequence profile: for each position a 20-valued vector contains F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0

    G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0

    valued vector contains the aminoacidic composition of the li d p

    rofil

    e

    M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0

    aligned sequences.

    quen

    ce p

    T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

    Se

  • Prediction of Initiation Sites of Protein FoldingPrediction of Initiation Sites of Protein Folding

    The Early Stages of Folding:

    g

    The Folding ProcessThe Folding Process h Ear y Stag s of Fo ngInitiation SitesThe Unfolded ChainFolded Protein

  • Frustration in proteins

    • The simultaneous minimisation of all the interaction energies is impossible

  • The network architecture

    Output Non

    p

    Hidden

    Input

    ..ALS.......QGFLLIARQPPFTYFTV......HW..

    Input Window Q Q

  • The prediction efficiency of the network

    Q2 = 0.85 Q(H) = 0.67 Q(nonH) = 0.93 Sovpred = 0.85

    C = 0.63 Pc(H) = 0.80 Pc(nonH) = 0.86 Sovobs = 0.76

  • Theoretical background

    The conformation of residue R depends both on local (window W)and non local (context C) interactions.

    Context CContext C

    Residue RWindow W

    Neural Network

    The convergence theorem ensures that:

    O Onon The convergence theorem ensures that:

    Oi = Probability ( StructureR= i | W )

    If for any i O 1 then the structure of residue R dependsIf , for any i, Oi 1 , then the structure of residue R depends mainly on W and only slightly on C

  • R W C

    P ( | ) ( ) W C ( W C)

    R W C• Anfinsen’s hypothesis:

    P ( | , ) ( , ) i i natW C ( W,C)

    • Averaging over all the contexts (performed by NN):

    C

    P W W C P Ci i( | ) ( | , ) ( ) P

    g g (p y )

    C

    • When the pattern is self-stabilising (W dependent):P ( | , )i W C P ( | )i W=

    Th th A fi ’ h th i b t i l l f

    • Then the Anfinsen’s hypothesis can be cast in a local form:

    P W W C P Ci iC

    i nat( | ) ( | , ) ( ) ( , (W) ) P

  • Relationship between the reliability index and the Shannon entropy

    1

    Shannon entropy

    0.8

    0.9

    1

    0.6

    0.7

    ity In

    dex

    0.4

    0.5

    Rel

    iabi

    li

    0.2

    0.3

    0

    0.1

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Entropy (S5)

  • INPUT

    MAS..... QLMLKDFLNRTPL.........GHI

    ......... ..........

    O O non-

    S Oi l OiS = i Oi log Oi_

  • Protein segments correctly predicted in -helical structure

    100

    80

    100

    40

    60NC / NT (%)

    0

    20

    13579

    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

    0

    Entropy Segment length

    Entropy = Shannon-entropy in (ln 2)/10 units ( S = -i o i ln ( o i ) )

    9Segment length

    Entropy Shannon entropy in (ln 2)/10 units ( S i o i ln ( o i ) )NC = Number of protein segments correctly predicted in -helixNT = Total number of protein segments predicted in -helix

  • Profile of the smoothed entropy (S5) for the hen egg lysozyme (132L)

    0.7lysozyme (132L)

    S5

    0 5

    0.6 EntropyPredicted helicesE t t d f t

    0.4

    0.5 Extracted fragments

    0.3

    0.1

    0.2

    01 11 21 31 41 51 61 71 81 91 101 111 121

    Protein chain

  • Hen egg lysozyme (132L)gg y y ( )

    C terminusN-terminus

    C-terminus

  • Frequency distribution of predicted helical segments as a function of their entropy value

    C

    segments as a function of their entropy value

    0 15

    0.2 Frequency CorrectWrongDifferences

    0.1

    0.15 Differences

    0.05

    00.1 0.2 0.3 0.4 0.5 0.6 0.70.0

    0 1

    -0.05 Entropy (S5)Threshold value

    -0.1

  • An example of the data base of minimally frustrated protein fragmentsprotein fragments

    http://www.biocomp.unibo.it/DB/ CODE ENTROPY POSITIONS SEQUENCE DSSP SECONDARY STRUCTURE 1msk_ 0.002 192-206 ADRLAEAFAEYLHER HHHHHHHHHHHHHHH 1pyda 0.004 307-319 MKFVLQKLLTNIA HHHHHHHHHHHHH1ngr 0 005 63 72 LDALLAALRR HHHHHHHHHH 1ngr_ 0.005 63-72 LDALLAALRR HHHHHHHHHH 1sly_ 0.005 338-346 AKEILHQLM HHHHHHHH! 1aerb 0.006 20-28 VERLLQAHR HHHHHHHHH 1bcn_ 0.006 113-123 LENFLERLKTI HHHHHHHHHHH 1bib_ 0.006 215-226 LAAMLIRELRAA HHHHHHHHHHHH 1fkx_ 0.006 337-346 KKELLERLYR HHHHHHHHHH 2arcb 0.006 148-158 NLLEQLLLRRM HHHHHHHHHHH1aqt 0 008 112-125 DYAQASAELAKAIA HHHHHHHHHHHHHH 1aqt_ 0.008 112 125 DYAQASAELAKAIA HHHHHHHHHHHHHH 1fit_ 0.008 111-120 EEEXAAEAAA HHHHHHHHHH 1mtyg 0.009 22-30 LEKAAEMLK HHHHHHHHH 2tct_ 0.009 50-60 LLDALAVEILA HHHHHHHHHHH 1hsba 0.010 150-157 AHVAEQWR !!HHHHHH 2chsa 0.010 17-26 EEILQKTKQL HHHHHHHHHH 1hjp_ 0.011 175-184 ETLIREALRA HHHHHHHHH!1pou 0.011 5-13 LEQFAKTFK HHHHHHHHH 1pou_ 0.011 5 13 LEQFAKTFK HHHHHHHHH

    ..........................................................................................................................

  • Training set from PDBTraining set from PDB

    Number of Number of Number of Averageproteins amino acids -helices

    glength

    822 174191 4783 116

    Data base of minimally frustrated -helical segments

    Number ofproteins

    Number ofamino acids

    Number of-helical segments

    Averagelength

    626 21553 3000 72

  • Comparison of minimally frustrated segments with t ti f ldi i iti ti it i t ll d t i dputative folding initiation sites experimentally determined

    i i i fPDBCode

    Entropy(S5)

    Position in theprotein chain

    ExtractedSequence

    Reference

    132l 0.109 8-14 LAAAMKR Radford & al. (1992)3 0. 09 8 Radford & al. (1992)132l 0.212 89-95 TASVNCA Radford & al. (1992)1hfx 0.186 86-91 TDDIMC Chyan & al. (1993)1hfx 0 221 7-13 ALSHELN *1hfx 0.221 7 13 ALSHELN *1hrc 0.156 92-99 EDLIAYLK Jeng & al. (1990)2mm1 0.050 127-132 AQGAMN Hugson & al. (1990)2mm1 0 104 139 146 RKDMASNY H & l (1990)2mm1 0.104 139-146 RKDMASNY Hugson & al. (1990)2mm1 0.154 105-111 EFISEAI Hugson & al. (1990)7rsa 0.409 8-11 FERQ Udgaonkar & Baldwin (1990)1 0 322 2 281ubq 0.322 25-28 NVKA Briggs & Roder (1992)1gf1 0.311 10-16 LVDALQF Hua & al. (1996)2ci2 0.236 14-19 VEEAKK Fersht (1995)

    *Not yet experimentally detected

  • Comparison of minimally frustrated segments with peptides extracted from proteinspeptides extracted from proteins

    Code* Peptides* % Helix insolution*

    Entropy(S5)

    ExtractedSegment

    3FXC TYKVTELINEAEGINETIDCDD 1 ##### ####3LZM GFTNSLRMLQQKRWDEAVNLAKS 10 0.262 WDEAVNL

    “ 10 0.329 LRMLQQK3LZM-2 GVAGFTNSLRMLQQKRWDEAAVNLAKS 12 0.203 SLRMLQ

    “ 12 0 210 DEAAVNL 12 0.210 DEAAVNLCIII ESLLERITRKLRDGWKRLIDIL 8 0.171 LLERIT

    “ 8 0.260 WKRLIDCIII-L ESLLERITRKL 15 0.171 LLERITCIII-R RDGWKRLIDIL 4 0.260 WKRLIDCIII-M RITRKLRDGWK 2 #### ####Sigma KVATTKAQRKLFFNLRKTKQRL 9 0.218 TKAQRKCOMA1 DHPAVMEGTKTILETDSNLS 4 #### ####COMA2 EPSEQFIKQHDFSSY 3 #### ####

    3 6 0 189COMA3 VNGMELSKQILQENPH 6 0.189 LSKQILQCOMA4 EVEDYFEEAIRAGLH 20 0.020 YFEEAIRCOMA5 KEKITQYIYHVLNGEIL 3 #### ####ARA1 AVGKSNLLSRYARNEFSA 2 #### ####ARA2 RFRAVTSAYYRGAVG 3 #### ####ARA2 RFRAVTSAYYRGAVG 3 #### ####ARA3 TRRTTFESVGRWLDELKIHSD 7.5 0.194 SVGRWLARA4 AVSVEEGKALAEEEGLF 4 #### ####ARA5 STNVKTAFEMVILDIYNNV 3 #### ####G1 DTYKLILNGKTLKGETTTEA 2 #### ####G2 GDAATAEKVFKKIANDNGVD 4 #### ####G3 GEWTYDDATKTFTVTE 2 #### ####

    * Muñoz and Serrano, 1994.

  • Minimally frustrated -helical segments are f l f d i iuseful for determining:

    • Folding initiation sites

    • -helix stability

    • de-novo design of -helicesde novo design of helices

  • Structure prediction of membrane u u p f m mproteins

  • Outer Membrane proteinsOuter Membrane proteins(all -Transmembrane proteins)

    Inner Membrane proteins( ll T b t i )(all -Transmembrane proteins)

  • Outer Membrane Inner Membrane-barrel -helices

    Porin (Rhodobacter capsulatus)

    Bacteriorhodopsin(Halobacterium salinarum)

  • Predictors of the Topology of Membrane Proteins

    position of Trans Membrane Segments along the sequenceTopography

    Lipidic Bilayer

    Out

    Lipidic Bilayer

    + Bilayer

    N In+

    +++ +

    Topology

    N

    C

    In

    position of N and C termini with Topology position of N and C termini with respect to the bilayer

  • Prediction of transmembrane segments

  • TM nonTMNeural Network for the prediction of TMS in

    b l b TM nonTM

    2 output neurons-barrel membrane proteins. (Jacoboni et al., 2001)

    5 hidden neurons

    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 0 60 0 0 0 0 0 0 0 0 0 00 0 0 10 0 33 0 0 010 0 30 0 30 0 100 0 00 0 0 0 10 0 0 10 300 0 0 0 10 0 0 10 300 40 0 0 0 0 0 0 100 0 0 0 0 0 0 0 300 0 0 0 0 0 0 30 00 0 0 0 0 0 0 0 00 0 0 0 10 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 40 0 0 0 300 50 0 0 0 0 0 0 00 50 0 0 0 0 0 0 00 0 0 0 0 33 0 0 020 0 0 0 0 33 0 0 00 0 0 0 10 0 0 0 00 10 0 0 0 0 0 0 070 0 0 90 0 0 0 0 0

    Window: 9 residues

  • A generic model for membrane proteins (TMHMM)A generic model for membrane proteins (TMHMM)End End

    Transmembrane Inner SideOuter Side

    Begin

  • Sequence-profile-based HMMSequence-profile-based HMMSequence of

    0 0 0 0 8 90 3

    ..A C L P R P E T ...

    t

    Sequence of characters ct

    90 08500

    0000

    00220

    03400

    8000

    90000

    3027

    tSequence of

    A di i l t st

    90000

    5000

    40130

    23005

    02400

    0000

    0000

    4086

    A-dimensional vectors st0000 For proteins A=200

    020

    0405

    50230

    0000

    0000

    0000

    6136

    00100

    For proteins A=20

    Constraints8000

    0600

    30110

    20220

    09200

    0000

    55125

    0 st (n) S t,nS=100

    0000

    Constraints

    0000

    02301

    0020

    01800

    0000

    00077

    56172

    S 100

    k=1 st (n) = S tA

    0000

    00

    440

    110

    00

    00

    023

    26

    k=1 t ( )00

    nMartelli et al., Bioinformatics 18, S46-53, 2002

  • The new algorithms make possible:

    •to feed HMMs with sequence profiles•to feed HMMs with sequence profiles

    •to eventually couple NNs and HMMs (Hidden u y up NN HMM (HNeural Networks)

    dAdvantages:

    •Higher performance than standard HMMs•Higher performance than standard HMMs

    •Increased discrimination capability of a given p y f gclass

    Martelli et al., Bioinformatics, 2002Martelli et al., Bioinformatics, 2002Martelli et al., Protein Eng. 2002,

  • Prediction of the Topology of -Transmembrane Proteins

    position of Trans Membrane Helices along the sequenceTopography

    The prediction accuracy of topography is 92%

    OutThe prediction accuracy of topology is 81 %

    + Bilayer

    N In+

    +++ +

    Topology

    N

    C

    In

    position of N and C termini with Topology position of N and C termini with respect to the bilayer

  • Prediction of the Topology of -Transmembrane Proteins

    position of Transmembrane Strands along the sequenceTopography:

    The prediction accuracy of topography is 73 %

    Th p di ti n f t p l i 73 %LPS (Out)

    The prediction accuracy of topology is 73 %

    + Bilayer

    N

    +++

    + +

    Topology:

    N

    CPeriplasmic (In)

    position of N and C termini with Topology: position of N and C termini with respect to the bilayer

  • 100

    The discriminative capability of the HMM model

    90

    100

    70

    80

    e

    Outer membrane

    Globular

    50

    60

    rcen

    tage Inner membrane

    30

    40Per

    10

    20

    02.75 2.8 2.85 2.9 2.95

    I(s | M) = -1/L log P(s | M)

  • An application: modeling the 3D structure of eukaryotic barrel

    proteinsproteins

  • 3D structure predicti n f pr teins3D structure prediction of proteins

    New folds Existing foldsNew folds Existing foldsMembrane proteins

    Threading/ Ab initio Building by h lfold

    recognitionprediction homology

    0 10 20 30 40 50 60 70 80 90 100 Homology (%)

  • Structural alignment of VDAC with the template 2omf_.seq/ AEIYNKDGNK VDLYGKAVGL HYFSKGNGEN SYGGNGDMTY ARLGFKGETQ 2omf_.str/ CCCCCCCCEE EEEEEEEEEE EEECCCCCCC CCCCCCCCCE EEEEEEEEEE protx.str/ *******CCC CCCCEEEEEE EEEC****** ********CE EEEEEEEECC protx.seq/ *******KGY NFGLWKLDLK TKTS****** ********SG IEFNTAGHSN 2omf_.seq/ I*NSDLTGYG QWEYNFQGNN SEGADAQTGN KTRLAFAGLK YADVGSFDYG 2omf_.str/ C*CCCEEEEE EEEEEEECCC CCCCCCCCCC EEEEEEEEEE ECCCEEEEEE protx.str/ CCCCCEEEEE EEEEEEC*** ********** EEEEEEEEEC CCCCCEEEEE protx.seq/ QESGKVFGSL ETKYKVK*** ********** DYGLTLTEKW NTDNTLFTEV 2omf_.seq/ RNYGVVYDAL GYTDMLPEFG GDTAYSDDFF VGRVGGVATY RNSNFFGLVD_ q2omf_.str/ ECCCCCCCCC CCCCCCCCCC CCCCCCCCCC CCCCCCEEEE EECCCCCCCC protx.str/ EEEECC**** ********** ********** **CCEEEEEE EEECCCCCCC protx.seq/ AVQDQL**** ********** ********** **LEGLKLSL EGNFAPQSGN 2omf_.seq/ GLNFAVQYLG KNER****** *********D TARRSNGDGV GGSISYEYE* 2omf_.str/ CEEEEEEEEC CCCC****** *********C CCCCCCCCEE EEEEEEEEC* protx str/ EEEEEEEEEE EEEECCCCCC CCCCCCCEEE EEEEEEEEEE EEEEEEECCCprotx.str/ EEEEEEEEEE EEEECCCCCC CCCCCCCEEE EEEEEEEEEE EEEEEEECCCprotx.seq/ KNGKFKVAYG HENVKADSDV NIDLKGPLIN ASAVLGYQGW LAGYQTAFDT 2omf_.seq/ **GFGIVGAY GAADRTNLQE AQPLGNGKKA EQWATGLKYD ANNIYLAANY 2omf_.str/ **CEEEEEEE EEEECCCCCC CCCCCCCCEE EEEEEEEEEE ECCEEEEEEE protx.str/ CCEEEEEEEE EEEEEEEEEE EEECCCCCCC EEEEEEEEEE CEEEEEEEEE protx.seq/ QQSKLTTNNF ALGYTTKDFV LHTAVNDGQE FSGSIFQRTS DKLDVGVQLSp q QQ Q Q Q 2omf_.seq/ GETRNATPIT NKFTNTSGFA NKTQDVLLVA QYQFDFGLRP SIAYTKSKAK 2omf_.str/ EEEECCCCCC CCCCCCCCCC CEEEEEEEEE EEECCCCEEE EEEEEEEEEE protx.str/ EEECC***** ********** *CCCEEEEEE EEECCCCEEE EEEEEEC*** protx.seq/ WASGT***** ********** *SNTKFAIGA KYQLDDDARV RAKVNNA*** 2omf seq/ DVEGIGDVDL VNYFEVGATY YFNKNMSTYV DYIINQIDSD NKLGVGSDDT2omf_.seq/ DVEGIGDVDL VNYFEVGATY YFNKNMSTYV DYIINQIDSD NKLGVGSDDT2omf_.str/ CCCCCCCEEE EEEEEEEEEE ECCCCEEEEE EEEEECCCCC CCCCCCCCCE protx.str/ *********E EEEEEEEEEE EC***EEEEE EEEEECCC** *****CCCCE protx.seq/ *********S QVGLGYQQKL RT***GVTLT LSTLVDGK** *****NFNAG 2omf_.seq/ VAVGIVYQF* *** 2omf .str/ EEEEEEEEE* ***_ /protx.str/ EEEEEEEEEE EC* protx.seq/ GHKIGVGLEL EA*

  • A low resolution 3D Model of VDAC h f N )the sequence from Neurospora crassa)

    Ca

  • A low resolution 3D model of VDAC:l f d dlocation of mutated residues

    Casadio et al., FEBS Lett 520:1-7 (2002) , ( )

  • Predictors of membrane protein pstructures can be used to filter genomes and find new membrane genomes and find new membrane

    proteins without sequence homologoueshomologoues

  • FISHING NEW OUTER FISHING NEW OUTER MEMBRANE PROTEINS IN

    GRAM-NEGATIVE BACTERIAMEMBRANE PROTEINS IN

    GRAM-NEGATIVE BACTERIAGRAM-NEGATIVE BACTERIAGRAM-NEGATIVE BACTERIA

  • Proteins have intrinsic signals that govern their g gtransport and localization in the cell: a secretion hydrophic marker (or signal peptide)

    Signal peptides in protein sequences:MRAKLLGIVLTTPIAISSFASTETLSFTPDNINADISLGTLSGKTKERVYLAEEGGRKVSQLDWKFNNAAIIKGAINWDLMPQISIGAAGWTTLGSRGGNMVDQDWMDSSNPGTWTDESRHPDTQLNYANEFDLNIKGWLLNEPNYRLGLMAGYQESRYSFTARGGSYIYSSEEGFRDDIGSFPNGER

    g p p p q

    AIGYKQRFKMPYIGLTGSYRYEDFELGGTFKYSGWVESSDNDEHYDPGKRITYRSKVKDQNYYSVAVNAGYYVTPNAKVYVEGAWNRVTNKKGNTSLYDHNNNTSDYSKNGAGIENYNFITTAGLKYTF

    Sequences of outer membrane proteins have signal peptides:

    th ti k i l k f t the secretion marker is also a marker of outer membrane proteins

  • Signal Peptide prediction Signal Peptide prediction

    Signal Pepetide Mature protein

    MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF

    g p p

    MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF

    Cleavage siteg

  • 2 Neural Networs

    MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF

    SignalNet CleavageNet

    Predicts if a given id i i

    Predicts if a given residue position belongs to the Signal Pepetide

    f gresidue position is the cleavage site

    Signal Pepetide

  • SignalNet Accuracy

    Organism Window C Q2

    Eukaryotes 15-1-15 0.83 0.95 Gram positive 15-1-15 0.79 0.92Gram negative 11-1-11 0 78 0 92Gram negative 11-1-11 0.78 0.92

  • CleavageNet AccuracyCleavageNet Accuracy

    Organism Window C Q2

    Eukaryotes 15-1-2 0.61 0.97Gram positive 20-1-3 0.56 0.96Gram negative 11-1-2 0 62 0 96Gram negative 11-1-2 0.62 0.96

  • Comparison with SignalP

    Organism SignalP SPEP

    Eukaryotes (+) 0.99 0.97 Eukaryotes (-) 0.85 0.94

    Prokaryotes(+) 0.99 0.97Prokaryotes ( ) 0 93 0 96Prokaryotes (-) 0.93 0.96

    Escherichia coli(+/-) 0.95 0.96( )

  • Performance of SignalNN on 2160 annotated proteins

    PredictionWithout T t lWith

    250

    W thoutsignal Total

    W thsignal

    th nal

    205 45

    Q2 = 96 %

    Q 82 %250

    tati

    onou

    tal

    Wit

    sign 205 45 Qsignal = 82 %

    Qnon-signal = 97 %

    Ann

    otW

    itho

    sign

    aal

    1910185555 Psignal = 78 %Pnon-signal = 98 %

    2160Tota

    l

    260 1900non signal

    Correct predictions

    Wrong predictionsWrong predictions

  • Predictors of Membrane Topography: Rate f f l i iof false positives

    The predictors are tested on on 809 globular protein with sequence identity 25 % :with sequence identity 25 % :

    0.5 % have at least 1 -TM helix predicted. p

    5.6 % have at least 2 -TM strand predicted

  • PROTEOME

    Signal peptide

    HUNTERSignal peptide

    All- TM All- TM

    All- TM

    all -TM all -TMall -TMGlobular Globular

  • Predicting globular, inner and outer membrane proteins in genomes of Gram-negative bacteria with Hunter

    Organism Outer membrane

    Inner membrane

    Globular Total

    Escherichia coli K12 65 (1.6%) 907 (21.7%) 3201 (76.7%) 4173 New*

    ( )18

    ( )136

    ( )1099 1253

    Escherichia coli O157:H7 New

    78 (1.5%)10

    1034 (19.3%)327

    4249 (79.2%)1564

    53611901

    Chlamidia pneumoniae CWL029 12 (1.1%) 290 (27.6%) 750 (71.3%) 1052Chlamidia pneumoniae CWL029 New

    12 (1.1%)2

    290 (27.6%)181

    750 (71.3%)236

    1052419

    Salmonella typhimurium LT2 New

    70 (1.6%)0

    1002 (22.5%)2

    3379 (75.9%)21

    445123

    Neisseria meningitidis MC58 34 (1.7%) 372 (18.4%) 1619 (80.0%) 2025Neisseria meningitidis MC58 New

    34 (1.7%)6

    372 (18.4%)176

    1619 (80.0%)662

    2025844

    Helicobacter pylori 26695 New

    36 (2.3%)10

    352 (22.5%)141

    1178 (75.2%)445

    1566596

    Haemophylus influentiae Rd 23 (1 3%) 348 (20 4%) 1338 (78 3%) 1709Haemophylus influentiae Rd New

    23 (1.3%)5

    348 (20.4%)121

    1338 (78.3%)430

    1709556

    Thermotoga maritima New

    18 (1.0%)11

    370 (20.0%)203

    1458 (79.0%)559

    1846773

    Pseudomonas aeruginosa 131 (2 4%) 1292 (23 2%) 4142 (74 4%) 5565Pseudomonas aeruginosa New

    131 (2.4%)62

    1292 (23.2%)616

    4142 (74.4%)1867

    55652545

    * the number of new proteins predicted in the class with Hunter, out of the non annotated regionnon-annotated region