computational molecular biology

168
Computational Molecular Biology Protein Structure: Introduction and Prediction

Upload: aislin

Post on 13-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Computational Molecular Biology. Protein Structure: Introduction and Prediction. Protein Folding. One of the most important problem in molecular biology Given the one-dimensional amino-acid sequence that specifies the protein, what is the protein’s fold in three dimensions?. Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Molecular Biology

Computational Molecular Biology

Protein Structure: Introduction and Prediction

Page 2: Computational Molecular Biology

My T. [email protected]

2

Protein Folding

One of the most important problem in molecular biology

Given the one-dimensional amino-acid sequence that specifies the protein, what is the protein’s fold in three dimensions?

Page 3: Computational Molecular Biology

My T. [email protected]

3

Overview

Understand protein structures Primary, secondary, tertiary

Why study protein folding: Structure can reveal functional information which

we cannot find from the sequence Misfolding proteins can cause diseases: mad cow

disease Use in drug designs

Page 4: Computational Molecular Biology

My T. [email protected]

4

Overview of Protein Structure

Proteins make up about 50% of the mass of the average human

Play a vital role in keeping our bodies functioning properly

Biopolymers made up of amino acids The order of the amino acids in a protein and

the properties of their side chains determine the three dimensional structure and function of the protein

Page 5: Computational Molecular Biology

My T. [email protected]

5

Building blocks of proteins

Consist of:An amino group (-NH2)Carboxyl group (-COOH)Hydrogen (-H)A side chain group (-R)

attached to the central α-carbon

There are 20 amino acids Primary protein structure

is a sequence of a chain of amino acids

C

RR

C

H

NO

OHH

H

Aminogroup

Carboxylgroup

Side chain

Amino Acid

Page 6: Computational Molecular Biology

My T. [email protected]

6

Side chains (Amino Acids)

20 amino acids have side chains that vary in structure, size, hydrogen bonding ability, and charge.

R gives the amino acid its identity R can be simple as hydrogen (glycine) or more

complex such as an aromatic ring (tryptophan)

Page 7: Computational Molecular Biology

7

Chemical Structure of Amino Acids

Page 8: Computational Molecular Biology

My T. [email protected]

8

How Amino Acids Become Proteins

Peptide bonds

Page 9: Computational Molecular Biology

My T. [email protected]

9

Polypeptide More than fifty amino acids in a chain are called a polypeptide. A protein is usually composed of 50 to 400+ amino acids. We call the units of a protein amino acid residues.

carbonylcarbonylcarboncarbon

amideamidenitrogennitrogen

Page 10: Computational Molecular Biology

My T. [email protected]

10

Side chain properties Carbon does not make hydrogen bonds with water

easily – hydrophobic. These ‘water fearing’ side chains tend to sequester themselves

in the interior of the protein O and N are generally more likely than C to h-bond to

water – hydrophilic Ten to turn outward to the exterior of the protein

Page 11: Computational Molecular Biology

My T. [email protected]

11

Page 12: Computational Molecular Biology

My T. [email protected]

12

Primary StructurePrimary structure: Linear String of Amino Acids

BackboneBackboneSide-chainSide-chain

... ALA PHE LEU ILE LEU ARG ...

Each amino acid within a protein is referred to as residues

Each different protein has a unique sequence of amino acid residues, this is its primary structure

Page 13: Computational Molecular Biology

My T. [email protected]

13

Secondary Structure

Refers to the spatial arrangement of contiguous amino acid residues

Regularly repeating local structures stabilized by hydrogen bonds A hydrogen atom attached to a relatively

electronegative atom

Examples of secondary structure are the α–helix and β–pleated-sheet

Page 14: Computational Molecular Biology

My T. [email protected]

14

Alpha-Helix

Amino acids adopt the form of a right handed spiral

The polypeptide backbone forms the inner part of the spiral

The side chains project outward every backbone N-H group

donates a hydrogen bond to the backbone C = O group

Page 15: Computational Molecular Biology

My T. [email protected]

15

Beta-Pleated-Sheet

Consists of long polypeptide chains called beta-strands, aligned adjacent to each other in parallel or anti-parallel orientation

Hydrogen bonding between the strands keeps them together, forming the sheet

Hydrogen bonding occurs between amino and carboxyl groups of different strands

Page 16: Computational Molecular Biology

My T. [email protected]

16

Parallel Beta Sheets

Page 17: Computational Molecular Biology

My T. [email protected]

17

Anti-Parallel Beta Sheets

Page 18: Computational Molecular Biology

My T. [email protected]

18

Mixed Beta Sheets

Page 19: Computational Molecular Biology

My T. [email protected]

19

Tertiary Structure

The full dimensional structure, describing the overall shape of a protein

Also known as its fold

Page 20: Computational Molecular Biology

My T. [email protected]

20

Quaternary Structure

Proteins are made up of multiple polypeptide chains, each called a subunit

The spatial arrangement of these subunits is referred to as the quaternary structure

Sometimes distinct proteins must combine together in order to form the correct 3-dimensional structure for a particular protein to function properly.

Example: the protein hemoglobin, which carries oxygen in blood. Hemoglobin is made of four similar proteins that combine to form its quaternary structure.

Page 21: Computational Molecular Biology

My T. [email protected]

21

Other Units of Structure

Motifs (super-secondary structure): Frequently occurring combinations of secondary

structure units A pattern of alpha-helices and beta-strands

Domains: A protein chain often consists of different regions, or domains Domains within a protein often perform different

functions Can have completely different structures and folds Typically a 100 to 400 residues long

Page 22: Computational Molecular Biology

My T. [email protected]

22

What Determines Structure

What causes a protein to fold in a particular way?

At a fundamental level, chemical interactions between all the amino acids in the sequence contribute to a protein’s final conformation

There are four fundamental chemical forces: Hydrogen bonds Hydrophobic effect Van der Waal Forces Electrostatic forces

Page 23: Computational Molecular Biology

My T. [email protected]

23

Hydrogen Bonds

Occurs when a pair of nucliophilic atoms such as oxygen and nitrogen share a hydrogen between them

Pattern of hydrogen bounding is essential in stabilizing basic secondary structures

Page 24: Computational Molecular Biology

My T. [email protected]

24

Van der Waal Forces

Interactions between immediately adjacent atoms

Result from the attraction between an atom’s nucleus and it neighbor’s electrons

Page 25: Computational Molecular Biology

My T. [email protected]

25

Electrostatic Forces

Oppositely charged side chains con form salt-bridges, which pulls chains together

Page 26: Computational Molecular Biology

My T. [email protected]

26

Experimental Determination

Centralized database (to deposit protein structures) called the protein Databank (PDB), accessible at http://www.rcsb.org/pdb/index.html

Two main techniques are used to determine/verify the structure of a given protein: X-ray crystallography Nuclear Magnetic Resonance (NMR)

Both are slow, labor intensive, expensive (sometimes longer than a year!)

Page 27: Computational Molecular Biology

My T. [email protected]

27

X-ray Crystallography

A technique that can reveal the precise three dimensional positions of most of the atoms in a protein molecule

The protein is first isolated to yield a high concentration solution of the protein

This solution is then used to grow crystals The resulting crystal is then exposed to an X-

ray beam

Page 28: Computational Molecular Biology

My T. [email protected]

28

Disadvantages

Not all proteins can be crystallized Crystalline structure of a protein may be

different from its structure Multiple maps may be needed to get a

consensus

Page 29: Computational Molecular Biology

My T. [email protected]

29

NMR

The spinning of certain atomic nuclei generates a magnetic moment

NMR measures the energy levels of such magnetic nuclei (radio frequency)

These levels are sensitive to the environment of the atom: What they are bonded to, which atoms they are

close to spatially, what distances are between different atoms…

Thus by carefully measurement, the structure of the protein can be constructed

Page 30: Computational Molecular Biology

My T. [email protected]

30

Disadvantages

Constraint of the size of the protein – an upper bound is 200 residues

Protein structure is very sensitive to pH.

Page 31: Computational Molecular Biology

My T. [email protected]

31

Computational Methods

Given a long and painful experimental methods, need computational approaches to predict the structure from its sequence.

Page 32: Computational Molecular Biology

My T. [email protected]

32

Functional Region Prediction

Page 33: Computational Molecular Biology

My T. [email protected]

33

Protein Secondary Structure

Page 34: Computational Molecular Biology

My T. [email protected]

34

Tertiary Structure Prediction

Page 35: Computational Molecular Biology

My T. [email protected]

35

More Details on X-ray Crystallography

Page 36: Computational Molecular Biology

My T. [email protected]

36

Overview

Page 37: Computational Molecular Biology

My T. [email protected]

37

Overview

Page 38: Computational Molecular Biology

My T. [email protected]

38

Crystal

A crystal can be defined as an arrangement of building blocks which is periodic in three dimensions

Page 39: Computational Molecular Biology

My T. [email protected]

39

Crystallize a Protein

Have to find the right combination of all the different influences to get the protein to crystallize

This can take a couple hundred or even thousand experiments

Most popular way to conduct these experiments Hanging-drop method

Page 40: Computational Molecular Biology

My T. [email protected]

40

Hanging drop method The reservoir contains a precipitant

concentration twice as high as the protein solution

The protein solutions is made up of 50% of stock protein solution and 50% of reservoir solution

Overtime, water will diffuse from the protein drop into the reservoir

Both the protein concentration and precipitant concentration will increase

Crystals will appear after days, weeks, months

Page 41: Computational Molecular Biology

My T. [email protected]

41

Properties of protein crystal

Very soft Mechanically fragile Large solvent areas (30-70%)

Page 42: Computational Molecular Biology

My T. [email protected]

42

A Schematic Diffraction Experiment

Page 43: Computational Molecular Biology

My T. [email protected]

43

Why do we need Crystals

A single molecule could never be oriented and handled properly for a diffraction experiment

In a crystal, we have about 1015 molecules in the same orientation so that we get a tremendous amplification of the diffraction

Crystals produce much simpler diffraction patterns than single molecules

Page 44: Computational Molecular Biology

My T. [email protected]

44

Why do we need X-rays

X-rays are electromagnetic waves with a wavelength close to the distance of atoms in the protein molecules

To get information about where the atoms are, we need to resolve them -> thus we need radiation

Page 45: Computational Molecular Biology

My T. [email protected]

45

A Diffraction Pattern

Page 46: Computational Molecular Biology

My T. [email protected]

46

Page 47: Computational Molecular Biology

My T. [email protected]

47

Resolution

The primary measure of crystal order/quality of the model

Ranges of resolution: Low resolution (>3-5 Ao) is difficult to see the side

chains only the overall structural fold Medium resolution (2.5-3 Ao) High resolution (2.0 Ao)

Page 48: Computational Molecular Biology

My T. [email protected]

48

Some Crystallographic Terms

h,k,l: Miller indices (like a name of the reflection)

I(h,k,l): intensity 2θ: angle between the x-ray incident beam and

reflect beam

Page 49: Computational Molecular Biology

My T. [email protected]

49

Diffraction by a Molecule in a Crystal

The electric vector of the X-ray wave forces the electrons in our sample to oscillate with the same wavelength as the incoming wave

Page 50: Computational Molecular Biology

My T. [email protected]

50

Description of Waves

Page 51: Computational Molecular Biology

My T. [email protected]

51

Structure Factor Equation

fj: proportional to the number of electrons this atom j has

One of the fundamental equations in X-ray Crystallography

Page 52: Computational Molecular Biology

My T. [email protected]

52

The Phase

From the measurement, we can only obtain the intensity I(hkl) of any given reflection (hkl)

The phase α(hkl) cannot be measured

Page 53: Computational Molecular Biology

My T. [email protected]

53

How to Determine the Phase Small changes are

introduced into the crystal of the protein of interest: Eg: soaking the crystal

in a solution containing a heavy atom compound

Second diffraction data set needs to be collected

Comparing two data sets to determine the phases (also able to localize the heavy atoms)

Page 54: Computational Molecular Biology

My T. [email protected]

54

Other Phase Determination Methods

Page 55: Computational Molecular Biology

My T. [email protected]

55

Electron Density Map

Once we know the complete diffraction pattern (amplitudes and phases), need to calculate an image of the structure

The above equation returns the electron density (so we get a map of where the electrons are their concentration)

Page 56: Computational Molecular Biology

My T. [email protected]

56

Interpretation of Electron Density Now, the electron density has to be interpreted in terms

of atom identities and positions.

(1): packing of the whole molecules is shown in the crystal

(2): a chain of seven amino acids in shown with the resulting structure superimposed

(3): the electron density of a trypophan side chain is shown

Page 57: Computational Molecular Biology

My T. [email protected]

57

Refinement and the R-Factor

Page 58: Computational Molecular Biology

My T. [email protected]

58

Nuclear Magnetic Resonance

Concentrated protein solution (very purified)

Magnetic field

Effect of radio frequencies on the resonance of

different atoms is measured.

Page 59: Computational Molecular Biology

My T. [email protected]

59

Page 60: Computational Molecular Biology

My T. [email protected]

60

NMR

Behavior of any atom is influenced by neighboring atoms

more closely spaced residues are more perturbed than distant residues

can calculate distances based on perturbation

Page 61: Computational Molecular Biology

My T. [email protected]

61

NMR spectrum of a protein

Page 62: Computational Molecular Biology

Computational Molecular Biology

Protein Structure: Secondary Prediction

Page 63: Computational Molecular Biology

My T. [email protected]

63

Primary Structure: Symbolic Definition

A = {A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } – set of symbols denoting all amino acids

A* - set of all finite sequences formed out of elements of A, called protein sequences

Elements of A* are denoted by x, y, z …..i.e. we write x A*, y A*, zA*, … etc

PROTEIN PRIMARY STRUCTURE: any x A* is also called a protein sequence or protein sub-unit

Page 64: Computational Molecular Biology

My T. [email protected]

64

Protein Secondary Structure (PSS)

Secondary structure: the arrangement of the peptide backbone in space. It is produced by hydrogen bondings between amino acids

PROTEIN SECONDARY STRUCTURE consists of: protein sequence and its hydrogen bonding patterns called SS categories

Page 65: Computational Molecular Biology

My T. [email protected]

65

Protein Secondary Structure

Databases for protein sequences are expanding rapidly

The number of determined protein structures (PSS – protein secondary structures) and the number of known protein sequences is still limited

PSSP (Protein Secondary Structure Prediction) research is trying to breach this gap.

Page 66: Computational Molecular Biology

My T. [email protected]

66

Protein Secondary Structure

The most commonly observed conformations in secondary structure are:Alpha HelixBeta Sheets/StrandsLoops/Turns

Page 67: Computational Molecular Biology

My T. [email protected]

67

Turns and Loops

Secondary structure elements are connected by regions of turns and loops

Turns – short regions of non-, non- conformation

Loops – larger stretches with no secondary structure.

Page 68: Computational Molecular Biology

My T. [email protected]

68

Three secondary structure states

Prediction methods are normally assessed for 3 states: H (helix) E (strands) L (others (loop or turn))

Page 69: Computational Molecular Biology

My T. [email protected]

69

Secondary Structure

8 different categories:H: - helixG: 310 – helix I: - helix (extremely rare) E: - strandB: - bridgeT: - turnS: bend L: the rest

Page 70: Computational Molecular Biology

My T. [email protected]

70

Three SS states: Reduction methods

Method 1, used by DSSP program: H(helix) ={ G (310 – helix), H (- helix)}E (strands) = {E (-strand), B (-bridge)} , L = all the rest• Shortly: E,B => E; G,H => H; Rest => C

Method 2, used by STRIDE program: H as in Method 1E = {E (-strand), b (isolated -bridge)},L = all the rest

Page 71: Computational Molecular Biology

My T. [email protected]

71

Three SS states: Reduction methods

Method 3, used by DEFINE program: H(helix) as in Method 1 E (strands) = {E (-strand)}, L = all the rest

Page 72: Computational Molecular Biology

My T. [email protected]

72

Example of typical PSS Data

Example:Sequence

KELVLALYDYQEKSPREVTHKKGDILTLLNSTNKDWWKYEYNDRQGFVP

Observed SS HHHHHLLLLEEEHHHLLLEEEEEELLLHHHHHHHHLLLEEEE

EELLLHHH

Page 73: Computational Molecular Biology

My T. [email protected]

73

PSS: Symbolic DefinitionGiven A =

{A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } – set of symbols denoting amino acids and a protein sequence x A*

Let S ={ H, E, L} be the set of symbols of 3 states: H (helix), E (strands) and L (loop) and S* be the set of all finite sequences of elements of S.

We denote elements of S* by e, e S*

Page 74: Computational Molecular Biology

My T. [email protected]

74

PSS: Symbolic Definition

Any one-to-one function

f : A* S* i.e. f A* x S* is called a protein secondary structure (PSS)

identification functionAn element (x, e) f is a called protein

secondary structure (of the protein sequence x)The element e S* (of (x, e) f ) is called

secondary structure.

Page 75: Computational Molecular Biology

My T. [email protected]

75

PSSP

If a protein sequence shows clear similarity to a protein of known three dimensional structure then the most accurate method of predicting the

secondary structure is to align the sequences by standard dynamic programming algorithms

Why? homology modelling is much more accurate than

secondary structure prediction for high levels of sequence identity.

Page 76: Computational Molecular Biology

My T. [email protected]

76

PSSP

Secondary structure prediction methods are of most use when sequence similarity to a protein of known structure is undetectable.

It is important that there is no detectable sequence similarity between sequences used to train and test secondary structure prediction methods.

Page 77: Computational Molecular Biology

My T. [email protected]

77

Classification and Classifiers

Given a database table DB with a special atribute C, called a class attribute (or decision attribute). The values: C1, C2, ...Cn of the class atrribute are called class labels.

Example: A1 A2 A3 A4 C

1 1 m g c1

0 1 v g c2

1 0 m b c1

Page 78: Computational Molecular Biology

My T. [email protected]

78

Classification and Classifiers The attribute C partitions the records in the DB:

divides the records into disjoint subsets defined by the attributes C values, CLASSIFIES the records.

It means we use the attributre C and its values to divide the set R of records of DB into n disjoint classes:

C1={ rDB: C=c1} ...... Cn={rDB: C=cn} Example (from our table)

C1 = { (1,1,m,g), (1,0,m,b)} = {r1,r3}

C2 = { (0,1,v,g)} ={r2}

Page 79: Computational Molecular Biology

My T. [email protected]

79

Classification and Classifiers An algorithm is called a classification algorithm if it uses the

data and its classification to build a set of patterns.

Those patterns are structured in such a way that we can use them to classify unknown sets of objects- unknown records.

For that reason (because of the goal) the classification algorithm is often called shortly a classifier.

The name classifier implies more then just classification algorithm. A classifier is final product of a data set and a classification algorithm.

Page 80: Computational Molecular Biology

My T. [email protected]

80

Classification and Classifiers Building a classifier consists of two phases:

training and testing. In both phases we use data (training data set and disjoint with

it test data set) for which the class labels are known for ALL of the records.

We use the training data set to create patterns We evaluate created patterns with the use of of test data,

which classification is known. The measure for a trained classifier accuracy is called

predictive accuracy. The classifier is build i.e. we terminate the process if it has

been trained and tested and predictive accuracy was on an acceptable level.

Page 81: Computational Molecular Biology

My T. [email protected]

81

Classifiers Predictive Accuracy

PREDICTIVE ACCURACY of a classifier is a percentage of well classified data in the testing data set.

Predictive accuracy depends heavily on a choice of the test and training data.

There are many methods of choosing test and and training sets and hence evaluating the predictive accuracy. This is a separate field of research.

Page 82: Computational Molecular Biology

My T. [email protected]

82

Accuracy Evaluation

Use training data to adjust parameters of method until it gives the best agreement between its predictions and the known classes

Use the testing data to evaluate how well the method works (without adjusting parameters!)

How do we report the performance?Average accuracy = fraction of all test examples

that were classified correctly

Page 83: Computational Molecular Biology

My T. [email protected]

83

Accuracy Evaluation

Multiple cross-validation test has to be performed to exclude a potential dependency of the evaluated accuracy on the particular test set chosen

Jack-Knife: Use 129 chains for setting up the tool (training set) 1 for estimating the performance (testing) This has to be repeated 130 times until each protein

has been used once for testing The average over all 130 tests gives an estimate of

the prediction accuracy

Page 84: Computational Molecular Biology

My T. [email protected]

84

PSSP Datasets Historic RS126 dataset. Contains126 sub-units with known

secondary structure selected by Rost and Sander. Today is not used anymore

CB513 dataset. Contains 513 sub-units with known secondary structure selected by Cuff and Barton in 1999. Used quite frencently in PSSP research

HS17771 dataset. Created by Hobohm and Scharf. In March-2002 it contained 1771 sub-units

Lots of authors has their own and “secret” datasets

Page 85: Computational Molecular Biology

My T. [email protected]

85

Measures for PSSP accuracy

http://cubic.bioc.columbia.edu/eva/doc/measure_sec.html (for more information)

Q3 :Three-state prediction accuracy (percent of succesful classified)

Qi %obs: How many of the observed residues

were correctly predicted? Qi

%prd: How many of the predicted residues were correctly predicted?

Page 86: Computational Molecular Biology

My T. [email protected]

86

Measures for PSSP Accuracy

Aij = number of residues predicted to be in structure type j and observed to be in type i

Number of residues predicted to be in structure i:

Number of residues observed to be in structure i:

3

1jjii Aa

3

1jiji Ab

Page 87: Computational Molecular Biology

My T. [email protected]

87

Measures for SSP Accuracy

The percentage of residues correctly predicted to be in class i relative to those observed to be in class i

The percentages of residues correctly predicted to be in class i from all residues predicted to be in i

Overall 3-state accuracy

100% i

iiobsii b

AQQ

100% i

iipredi a

AQ

100

3

13

b

AQ i

ii

Page 88: Computational Molecular Biology

My T. [email protected]

88

PSSP Algorithms

There are three generations in PSSP algorithms• First Generation: based on statistical information of

single amino acids (1960s and 1970s)• Second Generation: based on windows (segments)

of amino acids. Typically a window containes 11-21 amino acids (dominating the filed until early 1990s)

• Third Generation: based on the use of windows on evolutionary information

Page 89: Computational Molecular Biology

My T. [email protected]

89

PSSP: First Generation

First generation PSSP systems are based on statistical information on a single amino acid

The most relevant algorithms:Chow-Fasman, 1974GOR, 1978

Both algorithms claimed 74-78% of predictive accuracy, but tested with better constructed datasets were proved to have the predictive accuracy ~50% (Nishikawa, 1983)

Page 90: Computational Molecular Biology

My T. [email protected]

90

Chou-Fasman methodChou-Fasman method

Uses table of conformational parameters determined primarily from measurements of the known structure (from experimental methods)

Table consists of one “likelihood” for each structure for each amino acid

Based on frequencies of residues in -helices, -sheets and turns

Notation: P(H): propensity to form alpha helices f(i): probability of being in position 1 (of a turn)

Page 91: Computational Molecular Biology

My T. [email protected]

91

Chou-Fasman Pij-valuesName P(H) P(E) P(turn) f(i) f(i+1) f(i+2) f(i+3)

Alanine 142 83 66 0.06 0.076 0.035 0.058

Arginine 98 93 95 0.07 0.106 0.099 0.085

Aspartic Acid 101 54 146 0.147 0.11 0.179 0.081

Asparagine 67 89 156 0.161 0.083 0.191 0.091

Cysteine 70 119 119 0.149 0.05 0.117 0.128

Glutamic Acid 151 37 74 0.056 0.06 0.077 0.064

Glutamine 111 110 98 0.074 0.098 0.037 0.098

Glycine 57 75 156 0.102 0.085 0.19 0.152

Histidine 100 87 95 0.14 0.047 0.093 0.054

Isoleucine 108 160 47 0.043 0.034 0.013 0.056

Leucine 121 130 59 0.061 0.025 0.036 0.07

Lysine 114 74 101 0.055 0.115 0.072 0.095

Methionine 145 105 60 0.068 0.082 0.014 0.055

Phenylalanine 113 138 60 0.059 0.041 0.065 0.065

Proline 57 55 152 0.102 0.301 0.034 0.068

Serine 77 75 143 0.12 0.139 0.125 0.106

Threonine 83 119 96 0.086 0.108 0.065 0.079

Tryptophan 108 137 96 0.077 0.013 0.064 0.167

Tyrosine 69 147 114 0.082 0.065 0.114 0.125

Valine 106 170 50 0.062 0.048 0.028 0.053

Page 92: Computational Molecular Biology

My T. [email protected]

92

Chou-FasmanChou-Fasman

A prediction is made for each type of structure for each amino acid Can result in ambiguity if a region has high

propensities for both helix and sheet (higher value usually chosen)

Page 93: Computational Molecular Biology

My T. [email protected]

93

Chou-FasmanHow it works:1. Assign all of the residues the appropriate set of parameters2. Identify -helix and -sheet regions. Extend the regions in

both directions.3. If structures overlap compare average values for P(H) and

P(E) and assign secondary structure based on best scores.4. Turns are calculated using 2 different probability values.

Page 94: Computational Molecular Biology

My T. [email protected]

94

Assign Pij values

1. Assign all of the residues the appropriate set of parameters

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75

P(turn) 114 143 152 114 66 74 59 60 95 143 114 156

Page 95: Computational Molecular Biology

My T. [email protected]

95

Scan peptide for helix regions

2. Identify regions where 4 out of 6 have a

P(H) >100 “alpha-helix nucleus”

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Page 96: Computational Molecular Biology

My T. [email protected]

96

Extend -helix nucleus

3. Extend helix in both directions until a set of four consecutive residues with P(H) <100.

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Find sum of P(H) and sum of P(E) in the extended regionIf region is long enough ( >= 5 letters) and sum P(H) > sum P(E) then declare the extended region as alpha helix

Page 97: Computational Molecular Biology

My T. [email protected]

97

Scan peptide for -sheet regions4. Identify regions where 3 out of 5 have a

P(E) >100 “-sheet nucleus”

5. Extend -sheet until 4 continuous residues with an average P(E) < 100

6. If region average > 100 and the average P(E) > average P(H) then “-sheet”

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75

Page 98: Computational Molecular Biology

My T. [email protected]

98

Overlapping

Resolving overlapping alpha helix & beta sheetCompute sum of P(H) and sum of P(E) in the

overlap.If sum P(H) > sum P(E) => alpha helixIf sum P(E) > sum P(H) => beta sheet

Page 99: Computational Molecular Biology

My T. [email protected]

99

Turn PredictionAn amino acid is predicted as turn if all of the

following holds:f(i)*f(i+1)*f(i+2)*f(i+3) > 0.000075Avg(P(i+k)) > 100, for k=0, 1, 2, 3Sum(P(t)) > Sum(P(H)) and Sum(P(E)) for i+k, (k=0, 1,

2, 3)

Page 100: Computational Molecular Biology

My T. [email protected]

100

PSSP: Second Generation

Based on the information contained in a window of amino acids (11-21 aa.)

The most systems use algorithms based on: Statistical information Physico-chemical properties Sequence patterns Graph-theory Multivariante statistics Expert rules Nearest-neighbour algorithms

Page 101: Computational Molecular Biology

My T. [email protected]

101

PSSP: First & Second Generation

Main problems:Prediction accuracy <70%

SS assigments differ even between crystals of the same protein

SS formation is partially determined by long-range interactions, i.e., by contacts between residues that are not visible by any method based on windows of 11-21 adjacent residues

Page 102: Computational Molecular Biology

My T. [email protected]

102

PSSP: First & Second Generation

Main problems:Prediction accuracy for -strand 28-48%,

only slightly better than random beta-sheet formation is determined by more

nonlocal contacts than in alpha-helix formation

Predicted helices and strands are usually too short Overlooked by most developers

Page 103: Computational Molecular Biology

My T. [email protected]

103

Example of Second Generation

Example for typical secondary structure prediction of the 2nd generation.

The protein sequence (SEQ ) given was the SH3 structure. The observed secondary structure (OBS ) was assigned by DSSP (H

= helix; E = strand; blank = non-regular structure; the dashes indicate the continuation).

The typical prediction of too short segments (TYP ) poses the following problems in practice. (i) Are the residues predicted to be strand in segments 1, 5, and 6

errors, or should the helices be elongated? (ii) Should the 2nd and 3rd strand be joined, or should one of them

be ignored, or does the prediction indicate two strands, here? Note: the three-state per-residue accuracy is 60% for the prediction given.

Page 104: Computational Molecular Biology

My T. [email protected]

104

PSSP: Third GenerationPHD: First algorithm in this generation (1994) Evolutionary information improves the prediction

accuracy to 72%

Use of evolutionary information:1. Scan a database with known sequences with alignment methods

for finding similar sequences

2. Filter the previous list with a threshold to identify the most significant sequences

3. Build amino acid exchange profiles based on the probable homologs (most significant sequences)

4. The profiles are used in the prediction, i.e. in building the classifier

Page 105: Computational Molecular Biology

My T. [email protected]

105

PSSP: Third GenerationMany of the second generation algorithms

have been updated to the third generation

Page 106: Computational Molecular Biology

My T. [email protected]

106

PSSP: Third Generation

Due to the improvement of protein information in databases i.e. better evolutionary information, today’s predictive accuracy is ~80%

It is believed that maximum reachable accuracy is 88%. Why such conjecture?

Page 107: Computational Molecular Biology

My T. [email protected]

107

Why 88%

SS assignments may vary for two versions of the same structure Dynamic objects with some regions being more

mobile than others Assignment differ by 5-15% between different X-

ray (NMR) versions of the same protein Assignment diff. by about12% between structural

homologues B. Rost, C. Sander, and R. Schneider,

Redefining the goals of protein secondary structure predictions, J. Mol. Bio.

Page 108: Computational Molecular Biology

My T. [email protected]

108

PSSP Data Preparation

Public Protein Data Sets used in PSSP research contain protein secondary structure sequences. In order to use classification algorithms we must transform secondary structure sequences into classification data tables.

Records in the classification data tables are called, in PSSP literature (learning) instances.

The mechanism used in this transformation process is called window.

A window algorithm has a secondary structure as input and returns a classification table: set of instances for the classification algorithm.

Page 109: Computational Molecular Biology

My T. [email protected]

109

Window Consider a secondary structure (x, e).

where (x,e)= (x1x2 …xn, e1e2…en)

Window of the length w chooses a subsequence of length w of x1x2 …xn, and an element ei from e1e2…en, corresponding to a special position in the window, usually the middle

Window moves along the sequences

x = x1x2 …xn and e= e1e2…en

simultaneously, starting at the beginning moving to the right one letter at the time at each step of the process.

Page 110: Computational Molecular Biology

My T. [email protected]

110

Window: Sequence to StructureSuch window is called sequence to structure

window. We will call it for short a window.The process terminates when the window or its

middle position reaches the end of the sequence x.

The pair: (subsequence, element of e ) is often written in a form

subsequence H, E or L

is called an instance, or a rule.

Page 111: Computational Molecular Biology

My T. [email protected]

111

Example: Window

Consider a secondary structure (x, e) and the window of length 5 with the special position in the middle (bold letters)

Fist position of the window is:

x = A R N S T V V S T A A ….

e = H H H H L L L E E EWindow returns instance: A R N S T H

Page 112: Computational Molecular Biology

My T. [email protected]

112

Example: Window

Second position of the window is:

x = A R N S T V V S T A A ….

e = H H H H L L L E E EWindows returns instance:

R N S T V H

Next instances are:N S T V V LS T V V S LT V V S T L

Page 113: Computational Molecular Biology

My T. [email protected]

113

Symbolic NotationLet f be a protein secondary structure (PSS)

identification function:

f : A* S* i.e. f A* x S* Let x= x1x2…xn, e= e1e2…en, f(x)= e, we define

f(x1x2…xn)|{xi}= ei, i.e. f(x)|{xi}= ei

Page 114: Computational Molecular Biology

My T. [email protected]

114

Example:Semantics of Instances

Let• x = A R N S T V V S T A A ….•e = H H H H L L L E E E

And assume that the windows returns an instance:

A R N S T H

•Semantics of the instance is:

f(x)|{N}=H,

where f is the identification function and N is preceded by A R and followed by S T and the window has the length 5

Page 115: Computational Molecular Biology

My T. [email protected]

115

Classification Data Base (Table)We build the classification table with attributes being

the positions p1, p2, p3, p4, p5 .. pw in the window, where w is length of the window. The corresponding values of attributes are elements of of

the subsequent on the given position.Classification attribute is S with values in the set {H,

E, L} assigned by the window operation (instance, rule).

The classification table for our example (first few records) is the following.

Page 116: Computational Molecular Biology

My T. [email protected]

116

Classification Table (Example) x = A R N S T V V S T A A …. e = H H H H L L L E E E

p1 p2 p3 p4 p5 S

A R N S T H

R N S T V H

N S T V V L

S T V V S L

Semantics of record r= r(p1, p2, p3,p4,p5, S) is : f(x)|{Vp3} = Vs

where Va denotes a value of the attribute a.

Page 117: Computational Molecular Biology

My T. [email protected]

117

Size of classification datasets (tables)

The window mechanism produces very large datasets

For example window of size 13 applied to the CB513 dataset of 513 protein subunits produces about

70,000 records (instances)

Page 118: Computational Molecular Biology

My T. [email protected]

118

Window

Window has the following parameters:PARAMETER 1 : i N+, the starting point of

the window as it moves along the sequence x= x1 x2 …. xn. The value i=1 means that window starts at x1, i=5 means that window starts at x5

PARAMETER 2: w N+ denotes the size (length) of the window.

For example: the PHD system of Rost and Sander (1994) uses two window sizes: 13 and 17.

Page 119: Computational Molecular Biology

My T. [email protected]

119

Window

PARAMETER 3: p {1,2, …, w} where p is a special position of the window

that returns the classification attribute values from S ={H, E, L} and w is the size (length) of the window

PSSP PROBLEM:

find optimal size w, optimal special position p for the best prediction accuracy

Page 120: Computational Molecular Biology

My T. [email protected]

120

Window: Symbolic Definition

Window Arguments: window parameters and secondary structure (x,e)

Window Value: (subsequence of x, element of e)OPERATION (sequence – to –structure window)

W is a partial function

W: N+ N+ {1,…, k} (A* S* ) A* S

W(i, k, p, (x,e)) = (xi x(i+1)…. x(i+k-1), f(x)|{x(i+p)}) where (x,e)= (x1x2 ..xn, e1e2…en)

Page 121: Computational Molecular Biology

My T. [email protected]

121

Neural network models

machine learning approach provide training sets of structures (e.g. -helices, non -

helices) are trained to recognize patterns in known secondary

structures provide test set (proteins with known structures)

accuracy ~ 70 –75%

Page 122: Computational Molecular Biology

My T. [email protected]

122

Reasons for improved accuracy

Align sequence with other related proteins of the same protein family

Find members that has a known structure If significant matches between structure and

sequence assign secondary structures to corresponding residues

Page 123: Computational Molecular Biology

My T. [email protected]

123

3 State Neural Network

Page 124: Computational Molecular Biology

My T. [email protected]

124

Neural Network

Page 125: Computational Molecular Biology

My T. [email protected]

125

Input Layer

Most of approach set w = 17. Why? Based on evidence of statistical correlation with

secondary structure as far as 8 residues on either side of the prediction point

The input layer consists of: 17 blocks, each represent a position of window Each block has 21 units:

The first 20 units represent the 20 aa One to provide a null input used when the moving

window overlaps the amino- or carboxyl-terminal end of the protein

Page 126: Computational Molecular Biology

My T. [email protected]

126

Binary Encoding Scheme

Example: Let w = 5, and let say we have the sequence:

A E G K Q…. Then the input layer is: A,C,D,E,F,G,…,N,P,Q,R,S.T,V,W,Y

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …. 0 0

0 0… 1 0 …..

0 … 0 1 0 …..

Page 127: Computational Molecular Biology

My T. [email protected]

127

Hidden Layer

Represent the structure of the central aa Encoding scheme:

Can use two units to present: (1,0) = H, (0,1) = E, (0,0) = L

Some uses three units: (1,0,0) = H, (0,1,0) = E, (0,0,1) = L

For each connection, we can assign some weight value.

This weight value can be adjusted to best fit the data (training)

Page 128: Computational Molecular Biology

My T. [email protected]

128

Output Level

Based on the hidden level and some function f, calculate the output.

Helix is assigned to any group of 4 or more contiguous residues Having helix output values greater than sheet outputs and

greater than some threshold t

Strand (E) is assigned to any group of two or more contiguous resides, having sheet output values greater than helix outputs and greater than t

Otherwise, assigned to L Note that t can be adjusted as well (training)

Page 129: Computational Molecular Biology

My T. [email protected]

129

How PHD works

Step 1. BLAST search with input sequence

Step 2. Perform multiple seq. alignment and calculate aa frequencies for each position

Protein DSSP aligned sequence Pos. profile generation

K K-HK 1 K=0.75, H=0.25

E EDAE 2 E=0.6, D=0.2, A=0.2

L FFFF 3 L=0.2, F=0.8

N SAAS 4 N=0.2, S=0.4, A=0.4

D QKKQ 5 K=0.4,Q=0.4 D=0.2

L LLLL 6 L=1.0

E EEEE 7 E=1.0

K KEKK 8 K=0.2, E=0.2

Y FFYF 9 Y=0.4, F=0.6

N DDND 10 D=0.6, N=0.4

Page 130: Computational Molecular Biology

My T. [email protected]

130

How PHD works

Step 3. First Level: “Sequence to structure net” Input: alignment profile, Output: units for H, E, LCalculate “occurrences” of any of the residues to be present in either an a-helix, b-strand, or loop.

1234567

H = 0.05E = 0.18L= 0.67

N=0.2, S=0.4, A=0.4

Page 131: Computational Molecular Biology

My T. [email protected]

131

How PHD works

Step 3. Second Level: “Structure to structure net”

Input: First Level values, Output: units for H, E, L

Window size = 17

H = 0.59E = 0.09L= 0.31

E=0.18

Step 4. Decision level

Page 132: Computational Molecular Biology

My T. [email protected]

132

Prepare Data for PHD Neural Nets Starting from a sequence of

unknown structure (SEQUENCE ) the following steps are required to finally feed evolutionary information into the PHD neural networks:1. a data base search for

homologues (method Blast), 2. a refined profile-based dynamic-

programming alignment of the most likely homologues (method MaxHom)

3. a decision for which proteins will be considered as homologues (length-depend cut-off for pairwise sequence identity)

4. a final refinement, and extraction of the resulting multiple alignment. Numbers 1-3 indicate the points where users of the PredictProtein service can interfere to improve prediction accuracy without changes made to the final prediction method PHD .

http://cubic.bioc.columbia.edu/papers/2000_rev_humana/paper.html

Page 133: Computational Molecular Biology

My T. [email protected]

133

PHD Neural Network

Page 134: Computational Molecular Biology

My T. [email protected]

134

Prediction Accuracy

Authors Year % acurracy MethodChou-Fasman 1974 50% propensities of aa's in 2nd structures Garnier 1978 62% interactions between aa'sLevin 1993 69% multiple seq. alignments (MSA)Rost & Sander 1994 72% neural networks + MSA

Page 135: Computational Molecular Biology

My T. [email protected]

135

Where can I learn more?

Protein Structure Prediction Center Biology and Biotechnology Research Program

Lawrence Livermore National Laboratory, Livermore, CA

http://predictioncenter.llnl.gov/Center.html

DSSPDatabase of Secondary Structure Prediction

http://www.sander.ebi.ac.uk/dssp/

Page 136: Computational Molecular Biology

Computational Molecular Biology

Protein Structure: Tertiary Prediction via Threading

Page 137: Computational Molecular Biology

My T. [email protected]

137

Objective

Study the problem of predicting the tertiary structure of a given protein sequence

Page 138: Computational Molecular Biology

My T. [email protected]

138

A Few Examples

actual predicted actual

actual actual

predicted

predicted predicted

Page 139: Computational Molecular Biology

My T. [email protected]

139

Two Comparative Modeling

Homology modeling – identification of homologous proteins through sequence alignment; structure prediction through placing residues into “corresponding” positions of homologous structure models

Protein threading – make structure prediction through identification of “good” sequence-structure fit

We will focus on the Protein Threading.

Page 140: Computational Molecular Biology

My T. [email protected]

140

Why it Works?

Observations: Many protein structures in the PDB are very similar

Eg: many 4-helical bundles, globins… in the set of solved structure

Conjecture: There are only a limited number of “unique”

protein folds in nature

Page 141: Computational Molecular Biology

My T. [email protected]

141

Threading Method

General Idea: Try to determine the structure of a new sequence by

finding its best ‘fit’ to some fold in library of structures

Sequence-Structure Alignment Problem: Given a solved structure T for a sequence t1t2…tn

and a new sequence S = s1s2… sm, we need to find the “best match” between S and T

Page 142: Computational Molecular Biology

My T. [email protected]

142

What to Consider

How to evaluate (score) a given alignment of s with a structure T?

How to efficiently search over all possible alignments?

Page 143: Computational Molecular Biology

My T. [email protected]

143

Three Main Approaches

Protein Sequence Alignment 3D Profile Method Contact Potentials

Page 144: Computational Molecular Biology

My T. [email protected]

144

Protein Sequence Alignment Method

Align two sequences S and T If in the alignment, si aligns with tj, assign si to the

position pj in the structure

Advantages: Simple

Disadvantages: Similar structures have lots of sequence variability, thus

sequence alignment may not be very helpful

Page 145: Computational Molecular Biology

My T. [email protected]

145

3D Profile Method

Actually uses structural information Main idea:

Reduce the 3D structure to a 1D string describing the environment of each position in the protein. (called the 3D profile (of the fold))

To determine if a new sequence S belongs to a given fold T, we align the sequence with the fold’s 3D profile

First question: How to create the 3D profile?

Page 146: Computational Molecular Biology

My T. [email protected]

146

Create the 3D Profile

For a given fold, do:1. For each residue, determine:

How buried is it? Fraction of surrounding environment that is polar What secondary structure is it in (alpha-helix, beta-

sheet, or neither)

Page 147: Computational Molecular Biology

My T. [email protected]

147

Create the 3D profile

2. Assign an environment class to each position:

Six classes describe the burial and polarity criteria (exposed, partially buried, very buried, different fractions of polar environment)

Page 148: Computational Molecular Biology

My T. [email protected]

148

Create the 3D Profile

These environment classes depend on the number of surrounding polar residues and how buried the position is.

There are 3 SS for each of these, thus have 18 environment classes

Page 149: Computational Molecular Biology

My T. [email protected]

149

Create the 3D Profile

3. Convert the known structure T to a string of environment descriptors:

4. Align the new sequence S with E using dynamic programming

Page 150: Computational Molecular Biology

My T. [email protected]

150

Scores for Alignment

Need scores for aligning individual residues with environments.

Key: Different aa prefer diff. environment. Thus determine scores by looking at the statistical data

Page 151: Computational Molecular Biology

My T. [email protected]

151

Scores for Alignment

1. Choose a database of known structures

2. Tabulate the number of times we see a particular residue in a particular environment class -> compute the score for each env class and each aa pair

3. Choose gap penalties, eg. may charge more for gaps in alpha and beta environments…

Page 152: Computational Molecular Biology

My T. [email protected]

152

Alignment

This gives us a table of scores for aligning an aa sequence with an environment string

Using this scoring and Dynamic Programming, we can find an optimal alignment and score for each fold in our library

The fold with the highest score is the best fold for the new sequence

Page 153: Computational Molecular Biology

My T. [email protected]

153

Contact Potentials Method

Take 3D structure into account more carefully Include information about how residues interact with

each other Consider pairwise interactions between the position pi, pj in

the fold For a given alignment, produce a score which is the sum over

these interactions:

Page 154: Computational Molecular Biology

My T. [email protected]

154

Problem

Have a sequence from the database T = t1…tn with known positions p1…pn, and a new sequence S = s1…sm.

Find 1 <= r1 < r2 < … < rn < m which maximize

where ri is the index of the aa in S which occupies position pi

This problem is NP-complete for pairwise interactions

Page 155: Computational Molecular Biology

My T. [email protected]

155

How to Define that Score?

Use so-called “knowledge-based potentials”, which comes from databases of observed interactions.

The general form:

Page 156: Computational Molecular Biology

My T. [email protected]

156

How to Define the Score

General Idea: Define cutoff parameter for “contact” (e.g. up to 6

Angstroms) Use the PDB to count up the number of times aa i

and j are in contact

Several method for normalization. Eg. Normalization is by hypothetical random frequencies

Page 157: Computational Molecular Biology

My T. [email protected]

157

Other Variations

Many other variations in defining the potentials In addition to pairwise potentials, consider

single residue potentials Distance-dependent intervals:

Counting up pairwise contacts separately for intervals within 1 Angstrom, between 1 and 2 Angstroms…

Page 158: Computational Molecular Biology

My T. [email protected]

158

Threading via Tree-Decomposition

Page 159: Computational Molecular Biology

My T. [email protected]

159

Contact Graph

1. Each residue as a vertex2. One edge between two

residues if their spatial distance is within given cutoff.

3. Cores are the most conserved segments in the template

template

Page 160: Computational Molecular Biology

My T. [email protected]

160

Simplified Contact Graph

Page 161: Computational Molecular Biology

My T. [email protected]

161

Alignment Example

Page 162: Computational Molecular Biology

My T. [email protected]

162

Alignment Example

Page 163: Computational Molecular Biology

My T. [email protected]

163

Calculation of Alignment Score

Page 164: Computational Molecular Biology

My T. [email protected]

164

Graph Labeling Problem Each core as a vertex

Two cores interact if there is an interaction between any two residues, each in one core

Add one edge between two cores that interact.

Each possible sequence alignment position for a single corecan be treated as a possible label assignment to a vertex in GD[i] = be a set of all possible label assignments to vertex i.Then for each label assignment A(i) in D[i], we have:

a

b

c

d f

e

m

l k j

i

h

s

Page 165: Computational Molecular Biology

My T. [email protected]

165

Tree Decomposition

Page 166: Computational Molecular Biology

My T. [email protected]

166

Tree Decomposition[Robertson & Seymour, 1986]

h

Greedy: minimum degree heuristic

a

b

c

d f

e

m

l k j

i

g

ac

d f

e

m

k j

i

h

gabd

l

1. Choose the vertex with minimum degree2. The chosen vertex and its neighbors form a

component3. Add one edge to any two neighbors of the chosen

vertex4. Remove the chosen vertex5. Repeat the above steps until the graph is empty

Page 167: Computational Molecular Biology

My T. [email protected]

167

Tree Decomposition (Cont’d)

Tree Decomposition

a

b

c

d f

e

m

l k j

i

h

gabd acd

clk

cdem defm

fgh

eij

ab ac

clk

cf

fgh

ij

remove dem

Page 168: Computational Molecular Biology

My T. [email protected]

168

Tree Decomposition-Based Algorithms

1. Bottom-to-Top: Calculate the minimal F function

2. Top-to-Bottom: Extract the optimal assignment

))(,())(,())(,())(,( min)A(

iililjijXX

iri XAXScoreXAXFXAXFXAXFri

The score of subtree rooted at Xi

The score of component Xi

The scores of subtree rooted at Xj

Xr

Xp Xi

Xj XlXq

Xir

XjiXli

A tree decomposition rooted at Xr

The scores of subtree rooted at Xl