protein structure – introduction “bioinformatics: genes, proteins and computers” orengo, jones...

Protein structure – introductionProtein structure – introduction““Bioinformatics: genes, proteins and computers” Orengo, Jones and Thornton Bioinformatics: genes, proteins and computers” Orengo, Jones and Thornton (2003).(2003).

Secondary structure elementsSecondary structure elements

-helix

-strand -sheet

Tertiary structure = protein foldTertiary structure = protein fold

complete 3-dimensional structure

why is it interesting? isn’t the sequence enough?

a key to understand protein function Structure-based drug design

detection of distant evolutionary relationships

the structure is more conserved!

Fold classificationFold classification

classification: clustering proteins into structural families

motivation?

profound analysis of evolutionary mechanisms constraints on secondary structure packing?

classification at domain level

hierarchical classification of protein domain structures in the

Brookhaven Protein Databank (PDB).domains are clustered at four major levels:

Class

Architecture

Topology

Homologous superfamily

Sequence family

CATH – Protein Structure ClassificationCATH – Protein Structure Classification

http://www.biochem.ucl.ac.uk/bsm/cath_new/

Classsecondary structure content: mainly ,mainly , – , low 2nd structure content.

Architecturegross orientation of secondary structures, independent of connectivity.

Topology ( = fold)clusters structures according to their topological connections.

CATH – hierarchical classificationCATH – hierarchical classification

CATH – architecturesCATH – architectures

CATH – architectures (cont.)CATH – architectures (cont.)

Homologous superfamily homologous domains identified by sequence similarity, and

structure similarity

Sequence family domains clustered in the same sequence families, with

sequence identity>35%

CATH – hierarchical classificationCATH – hierarchical classification

other classification schemes: SCOP, FSSP partial disagreement between them.

Growing demand for protein structures!Growing demand for protein structures!

PDB contains 20,868 structures

X-Ray and NMR have limitations.

WE NEED

FASTER METHODS!

GenBank contains 24,027,936 sequences!

Protein Structure PredictionProtein Structure Prediction

I) Ab initio = ‘from the beginning’

- Simulation (physics)

- search for conformation with lowest energy

- Knowledge-based (i.e. statistics)

protein sequence: RGYSLGNWVC KVFGRCELAA

AMKRHGLDNY AAKFESNFNT

QATNRNTDGS TDYGILQINS

RWWCNDGRTP GSRNLCNIPC

SALLSSDITA SVNCAKKIVS

DGNGMNAWVA WRNRCKGTDV

Limited to very short peptides!

Can known structures assist prediction?Can known structures assist prediction?

the number of possible folds seems to be limited!

CATH inspection: more then 36,000 domains, but...

only ~800 topology groups

Total of "new folds" (light blue) and "old folds" (orange) for a given year

PDB inspection:

a ‘new’ protein has

a good chance to be

of a known structure!

Template-based prediction (fold recognition)Template-based prediction (fold recognition)

II) Comparative modeling (homology modeling)

- alignment with homologous sequence of known structure.

- high sequence identity areas: similar structure

- variable areas: must be builtcan’t be used if no sequence similarity found!

III) Threading

- alignment with structure sequences in fold library

- sophisticated scoring function finds most similar fold

- ‘Threading’ aligns target sequence onto template structure

““What are the baselines for protein What are the baselines for protein fold recognition?” fold recognition?”

McGuffin, Bryson and Jones (2001)McGuffin, Bryson and Jones (2001)

Goals:

1. what constitutes a baseline level of success for protein

fold recognition methods, above random guesswork?2. can simple methods that make use of 2nd structure

information assign folds more reliably?

3. how valuable might these methods be in the rapid

construction of a useful hierarchical classification?

1. Absolute difference in length

2. Absolute difference in number of secondary structure elements 3. Simple alignment of secondary structure elements 4. Alignment of secondary structure elements (Przytycka et al., 1999) 5. Alignment of secondary structure elements without additional scoring 6. Alignment of secondary structure elements using DSSP as secondary structure assignment 7. Alignment of secondary structure elements with gap penalty 8. Alignment of secondary structure elements with gap penalty for long elements

9. Alignment of secondary structure elements with absolute difference in length as scoring scheme

10. Alignment of full length secondary structure strings 11. Alignment of primary sequence

shorten 2nd structure strings:CCCHHHHCCCEEECCHHHCCC HCECH.

pairwise alignment

scoring function also considers length of elements

The methods evaluated The methods evaluated (ordered by complexity and runtime)(ordered by complexity and runtime)

A representative set of protein domainsA representative set of protein domains

a set of 1087 domains representing different

“Sequence Families” was selected from CATH.

1. >1atx00 2. GAAaLbKSDGPNTRGNSMSGTIWVFGcPSGWNNbEGRAIIGYacKQ 3. EEE TTS S TTSSEEEEEESS TT EEE SSSSSEEEE 4. CEEEEEHHECEEEECCCECEEEECCCEECCEECEEECCEECEEEEC

generate an informative file for each domain:

First evaluation: true positive percentageFirst evaluation: true positive percentage

compare true positive percentage, at a fixed 3% false positive.

run each method on all possible pairs from the 1087 set

(a,b) (a,c) (a,d) ... (g,d) (g,e) ... (k,f) ... (r,s) .... ~590,000 pairs

CATH (g) != CATH (e)

CATH (r) = CATH (s)

CATH (a) != CATH (b)

CATH (a) = CATH (d)

for each list: go top downward, and compare assignment to CATH

true counter =

false counter =

0

0

1

1

2

CATH (k) != CATH (f)23

STOP!

3% false positives reached.

true positive for this method = 2%

Sort each score list by descending similarity score.

(a,d) 0.99

(g,e) 0.98

(r,s) 0.87

|

(a,b) 0.63

(k,f) 0.45

(g,d) 0.37

•lets assume there are

100 structurly similar pairs

And 100 dissimilar pairs

We need lower,upper controls to compare withWe need lower,upper controls to compare withlower control: intelligent guesswork

1. randomly assign CATH topology codes according to frequency

2. calculate true positive, false positive percentage

upper control: automated recognition (given the 3D structure)

1. FSSP, SCOP and CATH databases were screened for all

dissimilar domains that exist in the three of them.

2. FSSP gave similarity scores to all possible pairs.

3. FSSP assignments compared against CATH, and against SCOP.

Optimisation of similarity scoring methods: Optimisation of similarity scoring methods: “Class pre-filter”“Class pre-filter”

each domain was assigned a class according to 2nd structure:

percentage of residues constituting -helices / -strands

domain “1cgt03”

80% of AA in -strand

10% of AA in -helix

most accurate is method number 5: “Alignment of secondary structure elements without additional scoring”, with: 27.18% true positive.

partial agreement between classification schemes: FSSP compared with SCOP: 61.1%, FSSP compared with CATH: 46.7%

methods that use 2nd structure alignments are in better agreement with CATH

accuracy ordering of methods doesn’t correspond to their relative complexity

methods that use 2nd structure usually don’t benefit notably from class pre-filter.

Second evaluation: CASP-like sensitivitySecond evaluation: CASP-like sensitivity

similarly to CASP – we measure the sensitivity of each method:

what is the probability of a method correctly assigning a fold?

lower control: a random proportional fold assignment

upper control: FSSP was used as a scoring method

Sensitivity results:Sensitivity results:

method 5 wins again: 31.8% sensitivity.

other 2nd structure based methods with small gap.

sensitivity order of the methods ~ true positive percentage order.

Similarity trees - can we construct classification?Similarity trees - can we construct classification?

Best method’s similarity scores for all pairs were

clustered into a tree.

a. globin-like <>

casein kinase

b. immunoglobulin-like <>

thrombin subunit H

whole tree:

generally disordered

1ckjA2

1irk02

1phk02

1ampE2

1hcl02

1ckjA2

1gdj00

1kobA2

1hbg00

1babA0

1lhs00

1mba00

1eca00

1ithA0

1ash00

1flp00

1sctA0

1cpcA0

1ddt02

1colA0

(a) (b)1bec01

1tcrA2

1edhA2

1nfkA1

1itbB1

1cgt03

1svpA2

1jxpA2

1try02

1sgt02

1sgpE1

1sgpE2

1dar02

ConclusionsConclusions

1. Baseline level to be exceeded by fold recognition methods:

27% true positive assignments allowing 3% false positive;

sensitivity level of 32%.

2. methods which make use of 2nd structure information

seem more accurate and sensitive than those who don’t.

3. simple 2nd structure alignments alone can not construct

reliable classification hierarchy.

4. the agreement between FSSP, SCOP and CATH

classification schemes is surprisingly low.

protein structure – introduction “bioinformatics: genes, proteins and computers” orengo, jones...

Documents

structure sequences

structure strings

protein structures

secondary structure

protein fold recognition

sequence similarity

sequence identity35

target sequence