secondary structure prediction chou & fasman (1974)

Secondary structure prediction

Chou & Fasman (1974)

Protein Structure – Why do we care?• Structure Function Relation – The shape of a protein molecule directly determines its biological function.

• Proteins with similar function often have similar shape or similar regions or domains.

• Hence, if we find a new protein and know it’s shape, we can make a good guess about it’s biological function.

Why predict when we can get the real Why predict when we can get the real thing?thing?

PDB database : 24168 protein structures

Swiss-Prot Release : 143790 protein sequencesTrEMBL Release : 1075779 protein sequences

Primary structure

Secondary structure

Tertiary structure

Quaternary structure

Function

No problems

Overall 77% accurate at predicting

Overall 30% accurate at predicting

No reliable means of predicting yet

Do you feel like guessing?

Secondary structure is derived by tertiary coordinatesTo get to tertiary structure we need NMR, X-ray

We have an abundance of primaries..so why not use them?

Structure Prediction MethodsMethod Knowledge Approach Difficulty Usefulness

Homolgy

Modeling

Proteins of known structure

Identify related structure with sequence methods, copy 3D coords and modify as necessary

Relatively easy

Very, if sequence identity > 40% - drug design

Fold Recognition

Proteins of known structure

Same as above, but use more sophisticated methods to find related structure

Medium Limited due to poor models

Secondary structure predeiction

Sequence-structure

statistics

Forget 3D-arrangement

And predict where the helices/starnds are

Medium Can improve alignments, fold recognition, ab -initio

Abi initio prediction

Energy function statistics

Simulate folding, or generate lots of structures and try to pick the correct one

Very hard Not really

History• 1974. Chou and Fasman propose a statistical method based on the propensities of amino acids to

adopt secondary structures based on the observation of their location in 15 protein structures determined by X-ray diffraction. Clearly these statistics derive from the particular stereochemical and physicochemical properties of the amino acids. Rather than a position by position analysis the propensity of a position is calculated using an average over 5 or 6 residues surrounding each position. On a larger set of 62 proteins the base method reports a success rate of 50%.

• 1978 Garnier improved the method by using statistically significant pair-wise interactions as a determinant of the statistical significance. This improved the success rate to 62%

• 1993 Levin improved the prediction level by using multiple sequence alignments. The reasoning is as follows. Conserved regions in a multiple sequence alignment provides a strong evolutionary indicator of a role in the function of the protein. Those regions are also likely to have conserved structure, including secondary structure and strengthen the prediction by their joint propensities. This improved the success rate to 69%.

• 1994 Rost and Sander combined neural networks with multiple sequence alignments. The idea of a neural net is to create a complex network of interconnected nodes, where progress from one node to the next depends on satisfying a weighted function that has been derived by training the net with data of known results, in this case protein sequences with known secondary structures. The success rate is 72%.

Secondary Structure PredictionAlgorithms

• These methods are 70-75% accurate at predictingsecondary structure.

• A few examples are– Chou Fasman Algorithm– Garnier-Osguthorpe-Robson (GOR) method– Neural network models– Nearest-neighbor method

Chou-Fasman Algorithm

• Analyzed the frequency of the 20 amino acids in alpha helices,Beta sheets and turns.

• Ala (A), Glu (E), Leu (L), and Met (M) are strong predictors of helices

• Pro (P) and Gly (G) break helices.

• When 4 of 5 amino acids have a high probability of being in an alpha helix, it predicts a alpha helix.

• When 3 of 5 amino acids have a high probability of being in a strand, it predicts a strand.

• 4 amino acids are used to predict turns.

Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)

Alanine 1.42 0.83 0.66 0.06 0.076 0.035 0.058

Articles

• Chou, P.Y. and Fasman, G.D. (1974).

Conformational parameters for amino acids in helical, -sheet, and random coil regions calculated from proteins.

Biochemistry 13, 211-221.

• Chou, P.Y. and Fasman, G.D. (1974).

Prediction of protein conformation.

Biochemistry 13, 222-245.

Method

• Assigning a set of prediction values to a residue, based on statistic analysis of 15 proteins

• Applying a simple algorithm to those numbers

Calculation of Propensities

• Pr[i|-sheet]/Pr[i], Pr[i|-helix]/Pr[i], Pr[i|other]/Pr[i] determine the probability that amino acid i is in each

structure, normalized by the background probability that i occurs at all.

Example.let's say that there are 20,000 amino acids in the database, of which 2000 are serine, and there are 5000 amino acids in helical conformation, of which 500 are serine. Then the helical propensity for serine is: (500/5000) / (2000/20000) = 1.0

Calculation of preference parameters

• Preference parameter > 1.0 specific residue has a preference for the specific secondary structure.

• Preference parameter = 1.0 specific residue does not have a preference for, nor dislikes the specific secondary structure.

• Preference parameter < 1.0 specific residue dislikes the specific secondary structure.

Preference parametersResidue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)

Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029

Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101

Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065

Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059

Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089

Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089

Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021

Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113

His 1.24 0.71 0.69 0.083 0.050 0.033 0.033

Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051

Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051

Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073

Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070

Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063

Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062

Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104

Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068

Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205

Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102

Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029

Residue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)

Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029

Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101

Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065

Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059

Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089

Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089

Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021

Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113

His 1.24 0.71 0.69 0.083 0.050 0.033 0.033

Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051

Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051

Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073

Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070

Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063

Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062

Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104

Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068

Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205

Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102

Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029

Residue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)

Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029

Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101

Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065

Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059

Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089

Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089

Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021

Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113

His 1.24 0.71 0.69 0.083 0.050 0.033 0.033

Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051

Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051

Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073

Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070

Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063

Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062

Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104

Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068

Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205

Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102

Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029

Applying algorithm1. Assign parameters (propensities) to residue.2. Identify regions (nucleation sites) where 4 out of 6 residues have

P(a)>100: a-helix. Extend helix in both directions until four contiguous residues have an average P(a)<100: end of a-helix. If segment is longer than 5 residues and P(a)>P(b): a-helix.

3. Repeat this procedure to locate all of the helical regions. 4. Identify regions where 3 out of 5 residues have P(b)>100: b-

sheet. Extend sheet in both directions until four contiguous residues have an average P(b)<100: end of b-sheet. If P(b)>105 and P(b)>P(a): -sheet.

5. Rest: P(a)>P(b) a-helix. P(b)>P(a) b-sheet.6. To identify a bend at residue number i, calculate the following

value: p(t) = f(i)f(i+1)f(i+2)f(i+3) If: (1) p(t) > 0.000075; (2) average P(t)>1.00 in the tetrapeptide;

and (3) averages for tetrapeptide obey P(a)<P(t)>P(b): b-turn.

Successful method?

15 proteins evaluated:• helix = 46%, ß-sheet = 35%, turn = 65%• Overall accuracy of predicting the three

conformational states for all residues, helix, b, and coil, is 56%

Chou & Fasman:Not so great ?After 1974:improvement of preference parameters

secondary structure prediction chou & fasman (1974)

Documents