class 7: protein secondary structure · shortcoming of this method: ignoring the context of the...

8
1 . Class 7: Protein Secondary Structure Protein Structure Amino-acid chains can fold to form 3-dimensional structures Proteins are sequences that have (more or less) stable 3-dimensional configuration Why Structure is Important? The structure a protein takes is crucial for its function Forms “pockets” that can recognize an enzyme substrate Situates side chain of specific groups to co-locate to form areas with desired chemical/electrical properties Creates firm structures such as collagen, keratins, fibroins Determining Structure X-Ray and NMR methods allow to determine the structure of proteins and protein complexes These methods are expensive and difficult Could take several work months to process one proteins A centralized database (PDB) contains all solved protein structures XYZ coordinate of atoms within specified precision ~23,000 proteins have solved structures Structure is Sequence Dependent Experiments show that for many proteins, the 3- dimensional structure is a function of the sequence Force the protein to loose its structure, by introducing agents that change the environment After sequences put back in water, original conformation/activity is restored However, for complex proteins, there are cellular processes that “help” in folding Levels of structure

Upload: others

Post on 13-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Class 7: Protein Secondary Structure · Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context

1

.

Class 7: Protein Secondary Structure

Protein Structure

Amino-acid chains can fold to form 3-dimensional structuresProteins are sequencesthat have (more or less) stable 3-dimensional configuration

Why Structure is Important?

The structure a protein takes is crucial for its functionForms “pockets” that can recognize an enzyme substrateSituates side chain ofspecific groups to co-locate to form areas with desired chemical/electrical propertiesCreates firm structures such ascollagen, keratins, fibroins

Determining Structure

X-Ray and NMR methods allow to determine the structure of proteins and protein complexesThese methods are expensive and difficult

Could take several work months to process one proteins

A centralized database (PDB) contains all solved protein structures

XYZ coordinate of atoms within specified precision~23,000 proteins have solved structures

Structure is Sequence Dependent

Experiments show that for many proteins, the 3-dimensional structure is a function of the sequence

Force the protein to loose its structure, by introducing agents that change the environmentAfter sequences put back in water, original conformation/activity is restored

However, for complex proteins, there are cellular processes that “help” in folding

Levels of structure

Page 2: Class 7: Protein Secondary Structure · Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context

2

Secondary Structure

α-helix β-strands

α HelixSingle protein chainTurn every 3.6 amino acidsShape maintained byintramolecular H bondingbetween -C=O and H-N-

Hydrogen Bonds in α-Helices Amphipathic α-helix

Hydrophilic residues on one sideHydrophobic residues on other side

β-StrandsAlternating 120’ angles

Often form sheets

β-Strands form Sheets

parallel Anti-parallelThese sheets hold together by hydrogen bonds across strands

Page 3: Class 7: Protein Secondary Structure · Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context

3

…which can form a β-barrel

porin – a membranal transoporter

Angular Coordinates

Secondary structures force specific angles between residues

Ramachandran Plot

We can relate angles to types of structures

Define "secondary structure"

3D protein coordinates may be converted to a 1D secondary structure representation using DSSP or STRIDE

DSSP EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT

DSSP= Database of Secondary Structure in Proteins

DSSP symbolsH = helix backbone angles (-50,-60) and H-bonding pattern (i-> i+4)

E = extended strand backbone angles (-120,+120) with beta-sheet H-bonds (parallel/anti-parallel are not distinguished)

S= beta-bridge (isolated backbone H-bonds)

T=beta-turn (specific sets of angles and 1 i->i+3 H-bond)

G=3-10 helix or turn (i,i+3 H-bonds)

I=Pi-helix (i,i+5 Hbonds) (rare!)

_= unclassified. None-of-the-above. Generic loop, or beta-strand with no regular H-bonding.

L

Labeling Secondary Structure

Using both hydrogen bond patterns and angles, we can label secondary structure tags from XYZ coordinate of amino-acids

These do not lead to absolute definition of secondary structure

Page 4: Class 7: Protein Secondary Structure · Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context

4

Prediction of Secondary Structure

Input:amino-acid sequence

Output:Annotation sequence of three classes:

alphabetaother (sometimes called coil/turn)

Measure of success: Percentage of residues that were correctly labeled

Accuracy of 3-state predictions

Q3-score = % of 3-state symbols that are correct

Measured on a "test set"Test set == An independent set of cases (proteins) that were notused to train, or in any way derive, the method being tested.

Best methods:

PHD (Burkhard Rost) -- 72-74% Q3

Psi-pred (David T. Jones) -- 76-78% Q3

True SS: EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TTPrediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL

Prediction Accuracy

Three-state prediction accuracy: Q3

3

ii1

3obs

M100

NiQ ==∑

Other measuresPer-segment accuracy(SOV), Mathews’ coefficient

What can you do with a secondary structure prediction?

(1) Find out if a homolog of unknown structure is missing any of the SS (secondary structure) units, i.e. a helix or a strand.

(2) Find out whether a helix or strand is extended/shortened in the homolog.

(3) Model a large insertion or terminal domain

(4) Aid tertiary structure prediction

Statistical Methods

From PDB database, calculate the propensity for a given amino acid to adopt a certain ss-type

( | ) ( , )( ) ( ) ( )

i i i

i

P aa p aaPp p p aaαα α

α α= =

Example:#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500P(α,aa) = 500/20,000, p(α) = 4,000/20,000, p(aa) = 2,000/20,000

P = 500 / (4,000/10) = 1.25

Used in Chou-Fasman algorithm (1974)

Page 5: Class 7: Protein Secondary Structure · Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context

5

Chou-Fasman: Initiation

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Identify regions where 4/6 have propensity P(H) >1.00 This forms a “alpha-helix nucleus”

Chou-Fasman: Propagation

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Extend helix in both directions until a set of four residues have an average P(H) <1.00.

Chou-Fasman PredictionPredict as α-helix segment with

E[Pα] > 1.03E[Pα] > E[Pβ]Not including proline

Predict as β -strand segment with E[Pβ] > 1.05E[Pβ] > E[Pα]

Others are labeled as turns/loops.

(Various extensions appear in the literature)

Achieved accuracy: around 50%Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids

We would like to use the sequence context as an input to a classifierThere are many ways to address this.The most successful to date are based on neural networks

A Neuron Artificial Neuron

W1

W2

Wk

∑+i

iiaWbf )(

Input Output

a1

a2

ak

•A neuron is a multiple-input -> single output unit

•Wi = weights assigned to inputs; b = internal “bias”

•f = output function (linear, sigmoid)

Page 6: Class 7: Protein Secondary Structure · Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context

6

Artificial Neural Network

W1

W2

Wk

a1

a2

a3

Input

Hidden Output

o1

om

•Neurons in hidden layers compute “features” from outputs of previous layers

•Output neurons can be interpreted as a classifier

Example: Fruit Classifer

Shape Texture Weight ColorApple ellipse hard heavy redOrange round soft light yellow

Qian-Sejnowski Architecture

......

...

...

oαoβoo

HiddenInput Output

Si

Si-w

Si+w

ss

kksss

ajajkajkk

jiaj

os

ubo

wialhas1i

maxarg

)(

}{

,

,,,,

,

+←

+←

=←

∑+

xe11xl −+

=)(

Neural Network Prediction

A neural network defines a function from inputs to outputsInputs can be discrete or continuous valuedIn this case, the network defines a function from a window of size 2w+1 around a residue to a secondary structure label for itStructure element determined by max(oα, oβ, oo)

Training Neural Networks

By modifying the network weights, we change the functionTraining is performed by

Defining an error score for training pairs <input,output>Performing gradient-descent minimization of the error scoreBack-propagation algorithm allows to compute the gradient efficientlyWe have to be careful not to overfit training data

Smoothing Outputs

The Qian-Sejnowski network assigns each residue a secondary structure by taking max(oα, oβ, oo)Some sequences of secondary structure are impossible:

αααβαααβααTo smooth the output of the network, another layer is applied on top of the three output units for each residue:

Neural networkMarkov model

Page 7: Class 7: Protein Secondary Structure · Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context

7

Success Rate

Variants of the neural network architecture and other methods achieved accuracy of about 65% on unseen proteins

Depending on the exact choice of training/test sets

Breaking the 70% ThresholdA innovation that made a crucial difference uses evolutionary information to improve prediction

Key idea: Structure is preserved more than sequenceSurviving mutations are not randomSuppose we find homologues (same structure) of the query sequenceThe type of replacements at position i during evolution provides us with information about the use of the residue i in the secondary structure

Nearest Neighbor Approach

Select a window around the target residuesPerform local alignment to sequences with known structure

Choice of alignment weight matrix to match remote homologiesAlignment weight takes into account the secondary structure of aligned sequence

Use max (na, nb, nc) or max(sa, sb, sc)Key: Scoring measure of evolutionary similarity.

PHD Approach

Multi-step procedure:Perform BLAST search to find local alignmentsRemove alignments that are “too close”Perform multiple alignments of sequencesConstruct a profile (PSSM) of amino-acid frequencies at each residueUse this profile as input to the neural networkA second network performs “smoothing”

PHD Architecture Psi-pred : same idea

(Step 1) Run PSI-Blast --> output sequence profile

(Step 2) 15-residue sliding window = 315 weights, multiplied by hidden weights in 1st neural net. Output is 3 weights (1 weight for each state H, E or L) per position.

(Step 3) 60 input weights, multiplied by weights in 2nd neural network, summed. Output is final 3-state prediction.

Performs slightly better than PHD

Page 8: Class 7: Protein Secondary Structure · Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context

8

Other Classification Methods

Neural Networks were used as a classifier in the described methods.We can apply the same idea, with other classifiers. E.g.: SVM

Advantages: Effectively avoid overfittingSupplies prediction confidence

SVM based approachSuggested by S. Hua and Z. Sun, (2001).Multiple sequence alignment from HSSP database (same as PHD)Sliding window of w → 21 × w input dimensionApply SVM with RBF kernelMulticlass problem:

Training: one-against-others (e.g. H/~H, E/~E, L/~L), binary (e.g. H/E)maximum output scoreDecision tree methodJury decision method

Decision tree

H / ~H

E / CYes

Yes

No

No

H E C

E / ~E

C/ HYes

Yes

No

No

E C H

C / ~C

H / EYes

Yes

No

No

C H E

Accuracy on CB513 set

75.077.457.774.772.0NN

76.279.560.375.273.5Jury

73.276.674.773.070.7Vote

70.877.046.669.567.5Tree3

71.469.061.072.068.2Tree2

72.173.154.073.568.9Tree1

75.479.058.674.872.9Max

SOVQCQEQHQ3Classifier

State of the ArtBoth PHD and Nearest neighbor get about 72%-74% accuracy

Both predicted well in CASP2 (1996)PSI-Pred slightly better (around 76%)Recent trend: combining classification methods

Best predictions in CASP3 (1998)

Failures:Long term effects: S-S bonds, parallel strandsChemical patterns Wrong prediction at the ends of helices/strands