secondary structure prediction chou & fasman (1974)
TRANSCRIPT
Secondary structure prediction
Chou & Fasman (1974)
Protein Structure – Why do we care?• Structure Function Relation – The shape of a protein molecule directly determines its biological function.
• Proteins with similar function often have similar shape or similar regions or domains.
• Hence, if we find a new protein and know it’s shape, we can make a good guess about it’s biological function.
Why predict when we can get the real Why predict when we can get the real thing?thing?
PDB database : 24168 protein structures
Swiss-Prot Release : 143790 protein sequencesTrEMBL Release : 1075779 protein sequences
Primary structure
Secondary structure
Tertiary structure
Quaternary structure
Function
No problems
Overall 77% accurate at predicting
Overall 30% accurate at predicting
No reliable means of predicting yet
Do you feel like guessing?
Secondary structure is derived by tertiary coordinatesTo get to tertiary structure we need NMR, X-ray
We have an abundance of primaries..so why not use them?
Structure Prediction MethodsMethod Knowledge Approach Difficulty Usefulness
Homolgy
Modeling
Proteins of known structure
Identify related structure with sequence methods, copy 3D coords and modify as necessary
Relatively easy
Very, if sequence identity > 40% - drug design
Fold Recognition
Proteins of known structure
Same as above, but use more sophisticated methods to find related structure
Medium Limited due to poor models
Secondary structure predeiction
Sequence-structure
statistics
Forget 3D-arrangement
And predict where the helices/starnds are
Medium Can improve alignments, fold recognition, ab -initio
Abi initio prediction
Energy function statistics
Simulate folding, or generate lots of structures and try to pick the correct one
Very hard Not really
History• 1974. Chou and Fasman propose a statistical method based on the propensities of amino acids to
adopt secondary structures based on the observation of their location in 15 protein structures determined by X-ray diffraction. Clearly these statistics derive from the particular stereochemical and physicochemical properties of the amino acids. Rather than a position by position analysis the propensity of a position is calculated using an average over 5 or 6 residues surrounding each position. On a larger set of 62 proteins the base method reports a success rate of 50%.
• 1978 Garnier improved the method by using statistically significant pair-wise interactions as a determinant of the statistical significance. This improved the success rate to 62%
• 1993 Levin improved the prediction level by using multiple sequence alignments. The reasoning is as follows. Conserved regions in a multiple sequence alignment provides a strong evolutionary indicator of a role in the function of the protein. Those regions are also likely to have conserved structure, including secondary structure and strengthen the prediction by their joint propensities. This improved the success rate to 69%.
• 1994 Rost and Sander combined neural networks with multiple sequence alignments. The idea of a neural net is to create a complex network of interconnected nodes, where progress from one node to the next depends on satisfying a weighted function that has been derived by training the net with data of known results, in this case protein sequences with known secondary structures. The success rate is 72%.
Secondary Structure PredictionAlgorithms
• These methods are 70-75% accurate at predictingsecondary structure.
• A few examples are– Chou Fasman Algorithm– Garnier-Osguthorpe-Robson (GOR) method– Neural network models– Nearest-neighbor method
Chou-Fasman Algorithm
• Analyzed the frequency of the 20 amino acids in alpha helices,Beta sheets and turns.
• Ala (A), Glu (E), Leu (L), and Met (M) are strong predictors of helices
• Pro (P) and Gly (G) break helices.
• When 4 of 5 amino acids have a high probability of being in an alpha helix, it predicts a alpha helix.
• When 3 of 5 amino acids have a high probability of being in a strand, it predicts a strand.
• 4 amino acids are used to predict turns.
Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)
Alanine 1.42 0.83 0.66 0.06 0.076 0.035 0.058
Articles
• Chou, P.Y. and Fasman, G.D. (1974).
Conformational parameters for amino acids in helical, -sheet, and random coil regions calculated from proteins.
Biochemistry 13, 211-221.
• Chou, P.Y. and Fasman, G.D. (1974).
Prediction of protein conformation.
Biochemistry 13, 222-245.
Method
• Assigning a set of prediction values to a residue, based on statistic analysis of 15 proteins
• Applying a simple algorithm to those numbers
Calculation of Propensities
• Pr[i|-sheet]/Pr[i], Pr[i|-helix]/Pr[i], Pr[i|other]/Pr[i] determine the probability that amino acid i is in each
structure, normalized by the background probability that i occurs at all.
Example.let's say that there are 20,000 amino acids in the database, of which 2000 are serine, and there are 5000 amino acids in helical conformation, of which 500 are serine. Then the helical propensity for serine is: (500/5000) / (2000/20000) = 1.0
Calculation of preference parameters
• Preference parameter > 1.0 specific residue has a preference for the specific secondary structure.
• Preference parameter = 1.0 specific residue does not have a preference for, nor dislikes the specific secondary structure.
• Preference parameter < 1.0 specific residue dislikes the specific secondary structure.
Preference parametersResidue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)
Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029
Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101
Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065
Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059
Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089
Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089
Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021
Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113
His 1.24 0.71 0.69 0.083 0.050 0.033 0.033
Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051
Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051
Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073
Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070
Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063
Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062
Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104
Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068
Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205
Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102
Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029
Residue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)
Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029
Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101
Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065
Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059
Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089
Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089
Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021
Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113
His 1.24 0.71 0.69 0.083 0.050 0.033 0.033
Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051
Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051
Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073
Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070
Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063
Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062
Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104
Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068
Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205
Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102
Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029
Residue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)
Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029
Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101
Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065
Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059
Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089
Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089
Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021
Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113
His 1.24 0.71 0.69 0.083 0.050 0.033 0.033
Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051
Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051
Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073
Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070
Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063
Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062
Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104
Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068
Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205
Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102
Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029
Applying algorithm1. Assign parameters (propensities) to residue.2. Identify regions (nucleation sites) where 4 out of 6 residues have
P(a)>100: a-helix. Extend helix in both directions until four contiguous residues have an average P(a)<100: end of a-helix. If segment is longer than 5 residues and P(a)>P(b): a-helix.
3. Repeat this procedure to locate all of the helical regions. 4. Identify regions where 3 out of 5 residues have P(b)>100: b-
sheet. Extend sheet in both directions until four contiguous residues have an average P(b)<100: end of b-sheet. If P(b)>105 and P(b)>P(a): -sheet.
5. Rest: P(a)>P(b) a-helix. P(b)>P(a) b-sheet.6. To identify a bend at residue number i, calculate the following
value: p(t) = f(i)f(i+1)f(i+2)f(i+3) If: (1) p(t) > 0.000075; (2) average P(t)>1.00 in the tetrapeptide;
and (3) averages for tetrapeptide obey P(a)<P(t)>P(b): b-turn.
Successful method?
15 proteins evaluated:• helix = 46%, ß-sheet = 35%, turn = 65%• Overall accuracy of predicting the three
conformational states for all residues, helix, b, and coil, is 56%
Chou & Fasman:Not so great ?After 1974:improvement of preference parameters