proteins: structure & function - hu-berlin.de · structure (cath, scop, fssp) • highly...

Proteins:Proteins:Structure & Function

Ulf Leser

This Lecture

• Introduction– Structure– Function– Databases

• Predicting Protein Secondary Structure

• Many figures from Zvelebil, M. and Baum, J. O. (2008). "Understanding Bioinformatics", Garland Science, Taylor & Francis Group.E l f f O K hlb h V l S k h WS• Examples often from O. Kohlbacher, Vorlesung Strukturvorhersage, WS 2004/2005, Universität Tübingen

Ulf Leser: Bioinformatics, Summer Semester 2011 2

Central Dogma of Molecular Biologyg gy

The Real Picture

• Alternative Splicing• Alternative Splicing– “One gene – one protein” is wrong– Exons may be spliced from the protein sequence– Human: ~ 6 times more proteins than genes

• Post-translational modifications– (De-)Phosporylation, glycolysation, cleavage of signals, …– Estimates: 100K proteins, 500K protein forms

C l• Complexes– Proteins gather together to perform specific function

Example: Proteasomep

• Very large complexes present in allVery large complexes present in all eukaryotes (and more)– >2000 kDa, made of dozens of single

protein chains– Formation of the complex is a very

complex process only partly understood yetcomplex process only partly understood yet

• Breaks (mis-folded, broken, superfluous, …) proteins into small p , ) ppeptides for reuse– Suspicious proteins are tagged with

bi iti hi h bi d t th tubiquitin which binds to the proteasome

Protein Structure

• Primary– 1D-Seq. of AAq

• Secondary– 1D-Seq. of

“subfolds”

• Tertiary3D St t– 3D-Structure

• Quaternary3D Structure of– 3D-Structure of assembled complexes

Protein Function

• Proteins perform essentially everything that makes anProteins perform essentially everything that makes an organism alive– Metabolism– Signal processing– Gene regulation

Cell c cle– Cell cycle – …

• For ~1/3 of all human• For ~1/3 of all human gene, no function is known

• Describing functionDescribing function– Gene Ontology: 3 branches, >30.000 concepts– Used world-wide to describe gene/protein function

Protein Interactions and Networks

• Function usually is carried out by a complex interplayFunction usually is carried out by a complex interplay between many proteins and other molecules

• Pathways: (artificial) fragments of the cellular network y ( ) gassociated to a certain function– See lectures later

Function and Motifs

• A protein may “have”A protein may have many functions– Avg. n# of GO terms

assigned to a human protein: ~6

• Functions are carried out by substructures of a proteinC ll d tif d i– Called motifs or domains

• There probably exist only 4000-5000 motifsProteins: “Assemblies of functional motifs”– Proteins: Assemblies of functional motifs

• Performing a function often requires binding to another protein or moleculeprotein or molecule– The binding requires a certain constellation of the protein structure– Blocking such bindings is a major goal in pharmacological research

Protein Families and Classification

• Several DBs classify proteinsSeveral DBs classify proteins according to their overall structure (CATH, SCOP, FSSP)

• Highly related to evolutionary relationships (and function)

• Example: CATH– Class (all α, all β, α-β, other)

A hit t– Architecture– Topology (similar folds)– Homologous superfamilyHomologous superfamily

(sets of concrete proteins)

• Helpful to characterize novel proteins

Functional Classification

• Folds correlate with function, but many exceptionsFolds correlate with function, but many exceptions• Enzyme classification (EC-numbers)

– 4-level hierarchyy– Based on chemical

reactions that are catalyzedCl l t d t f ti– Closer related to functionthan classes of folds

– Relation protein:EC-number pis mostly 1:1

EC n mbe and GO• EC-number and GO-annotation are highly correlated– But >30 000 concepts versus <4000 EC-numbers

But >30.000 concepts versus <4000 EC numbers

Proteomics – Large Scale Protein Identificationg

• Measuring gene expression is comparably simpleMeasuring gene expression is comparably simple – Sequencing mRNA, microarrays (based on hybridization)

• Measuring proteins is much harderg p– Isolating proteins is very complex– Sequencing a protein is very slow

• Options (see lecture later)– Using 2D-Page

U i t t– Using mass spectrometry– De-dovo sequencing with MS/MS– Quantification is very difficultQuantification is very difficult

• Some classes of proteins are particularly hard to handle– Membrane proteins, non-soluble proteins

UniProt

• “Standard” database for protein sequences and annotationStandard database for protein sequences and annotation– Original name: SwissProt– Started at the Swiss Institute of Bioinformatics, now mostly EBI– Other: PIR, HPRD

• Continuous growth and curation– >30 „Scientific Database Curators“– Quarterly releases– Very rich annotation set– Very rich annotation set– Redundancy-free

• Actually two databasesActually two databases– SwissProt: Curated, high quality, versioned– TrEMBL: Automatic generation from (putative) coding genomic

sequences, low quality, redundant, much larger

UniProt: Species [http://www.expasy.org/sprot/relnotes/relstat.html]p

20258 Homo sapiens (Human) 16327 Mus musculus (Mouse) 9842 Arabidopsis thaliana (Mouse-ear cress)9842 Arabidopsis thaliana (Mouse ear cress) 7560 Rattus norvegicus (Rat) 6582 Saccharomyces cerevisiae (Baker's yeast) 5803 Bos taurus (Bovine)

( )…

PDB – Protein Structure Database

• Oldest protein database, evolved from a bookOldest protein database, evolved from a book• Contains all experimentally obtained protein 3D-structures

– Plus DNA, protein-ligand, complexes, …, p g , p ,– X-Ray (~75%), NMR (Nuclear magnetic resonance, ~23%)

• Still costly and slow techniques– Growth much smaller than that

of sequence-related DBs

Many problems with legacy• Many problems with legacydata and data formats

InterPro

• Integrated database of protein signatures, classifications,Integrated database of protein signatures, classifications, and motifs– Currently ~21.000 signatures

• Associates signatures with function (GO term)• InterProScan – quick identification of signatures in a

protein sequence– For a fast, first functional annotation

This Lecture

• IntroductionIntroduction• Predicting Protein Secondary Structure

– Secondary structure elementsy– Chou-Fasman– GOR IV – Other methods

Amino Acids

• An AA consists of a commonAn AA consists of a common„core“ and a specific residue– Amino group – NH2

– Central Cα - Carbon – CH– Carboxyl group – COOH

R id ( id h i )• Residues (side chains) vary greatly between AA

• Residues determine the• Residues determine the specific properties of a AA

Structure of a Protein

• The core forms the backbone of a protein (AA chain)The core forms the backbone of a protein (AA chain)• Main structure: Covalent

peptide bond between p pcarboxyl and amino group– Loss of H2O

Structure Flexibilityy

• In principle, every chemical bond can rotate freely, whichIn principle, every chemical bond can rotate freely, which would allow arbitrary structures to be formed

• In proteins things are much more restrictedp g– Peptide bound is “flat” – almost no torsion possible– Flexibility only in the Cα-flanking bonds φ and ψ

Ramachandran Plots

• Combinations of φ and ψ are highly constrainedCombinations of φ and ψ are highly constrained • Due to chemical properties of the backbone / side chains

• Two combinations are favored: α-helixes and β-sheetso o b a o a a o d α a d β• More detailed classifications exist• Angels lead to specific structures• Secondary structure

α-Helixes

• Comb. of angles forming a regularly structured helix

• Additional bonds between amino and carboxyl groups– Very stable structure

• May have two orientationsM t i ht h d d– Most are right-handed

• 3.4 AA per twistOft h t ti• Often short, sometimes very long

β-Sheetsβ

• Two linear and parallel stretches (β-strands)Two linear and parallel stretches (β strands)• Strands are bound together by hydrogen bounds• Can be parallel or anti-parallel (wrt. N/C terminus)Can be parallel or anti parallel (wrt. N/C terminus)

Other Substructures

• α-helixes and β-sheets form around 50-80% of a proteinα helixes and β sheets form around 50 80% of a protein• All other „glue“ parts are called loops or coils

– Usually not very important for the structure of the proteiny y p p– But very important for its function– Loops are often exposed on the surface and participate in binding

t th l lto other molecules

Importancep

• Secondary structure elements (SSE) are vital for the overall structure of a proteinp

• Thus, they often are evolutionary well conserved• SSE can be used to classify proteinsy p

– Such classes are highly associated to function

• Knowing secondary structure thus gives important clues to protein function

• Secondary structure prediction (SSP) is much simpler than 3D structure prediction– And 3D structure prediction can benefit a lot from a good SSP

Predicting Secondary Structureg y

• SSP: Given a protein sequence, assign each AA in theSSP: Given a protein sequence, assign each AA in the sequence one of the three classes Helix, E (Strand), or Coil

KVYGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFESNFNTHATNRNTDGSTDYGILQINSRWWCNDGRTPGSKNLCNIPCSALLSSDITASVNCAKKIASGGNGMNAWVAWRNRCKGTDVHAWIRGCRL

KVYGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFESNFNTHATNRNTD -----HHHHHHHHH-------------EEEEE----------------GSTDYGILQINSRWWCNDGRTPGSKNLCNIPCSALLSSDITASVNCAK ----EEEEEE--------------------------------HHHHHH KIASGGNGMNAWVAWRNRCKGTDVHAWIRGCRL HHH-------EEE--------------------

Classification

• Classification: Classify each AA into one of three classesy• Classification is a fundamental problem (not only in bioinformatics)

– Classify the readout of a microarray as diseased / healthyCl if b f di / di– Classify a subsequence of a genome as coding / non-coding

– Classify an email as spam / no spam

• A wealth of different techniques exist (Machine Learning)q ( g)– Naïve Bayes, Regression, Decision Trees, Support Vector Machines, Neural

Networks, Sequential Models, …

• Many of them use the same abstraction and can be exchanged easily• Many of them use the same abstraction and can be exchanged easily– Describe objects by features– Learn properties of feature values in the different classes– Derive a classification function on the features– Pre-requisite: Distribution of feature values are at least slightly different in

the different classes

This Lecture

– Secondary structure elementsy– Chou-Fasman– Other methods

Chou-Fasmann Algorithm Ch & F (1974) P di ti f t i f ti Bi h i t 13Chou & Fasman (1974). Prediction of protein conformation. Biochemistry 13

• Observation: Different AA favor different foldsObservation: Different AA favor different folds– Different AA are more or less often in H, E, C– Different AA are more or less often within, starting, or ending a

stretch of H, E, C

• Chou-Fasman algorithm (rough idea)Cl ifi h AA i t E H l ifi d AA i d C– Classifies each AA into E or H; unclassified AA are assigned C

– Compute a score for the probability of any AA to be E / H– Basis: Relative frequencies of assignments in a set of sequencesBasis: Relative frequencies of assignments in a set of sequences

with known secondary structure– In principle, assigns each AA its most frequent class– Add constraints about minimal length of E, H stretches and several

other heuristics

Some Details [sketch, some heuristics omitted]

• Let fj k be the relative frequency of observing AA j in class kLet fj,k be the relative frequency of observing AA j in class k• Let fk be the average over all 20 fj,k values• Compute the propensity of AA j to be part of class k asCompute the propensity of AA j to be part of class k as

Pj,k=fj,k/fk

• Using Pj,k, classify each AA j for class k according into– Strong, normal, weak builder (Hα, hα, Iα)Strong, normal, weak builder (Hα, hα, Iα)– Strong, weak breaker (Bα, bα)– Indifferent (iα)

• For now, context (neighboring AAs) is ignored completely

Concrete Values

• Originally computed on only 15 proteins (1974)Originally computed on only 15 proteins (1974)

Algorithm for Helicesg

• Go through the protein sequenceGo through the protein sequence• Score each AA with 1 (Hα, hα), 0.5 (Iα, iα), or -1 (Bα, bα)• Find helix cores: subsequences of length 6 with anFind helix cores: subsequences of length 6 with an

aggregated AA score ≥ 4• Starting from the middle of each core, shift a window of g ,

length 4 to the left (then to the right)– Compute the aggregated Pk-score P inside the window– If P ≥ 4, continue; otherwise stop

• Similar method for strands• Conflicts (regions assigned both H and E) are resolved

based on aggregated Pk-scores

Examplep

Performance

• Accuracy app. 50-60%Accuracy app. 50 60%– Measured on per-AA correctness

• Prediction is more accurate in helices than in strands– Because helices build local bridges (hydrogen bounds between the

turns; each AA binds to the +4 AA)

G l bl• General problem– Secondary structure is not only a local problem

Looking only at single AAs is not enough– Looking only at single AAs is not enough• Note: Scores are based on individual AA; aggregation by summation

assumes statistical independence of pairs, triples … in a class

– One needs to include the context of an AA

This Lecture

– Secondary structure elementsy– Chou-Fasman– Other methods

Classes of Methods

• First generation: Only look at properties of single AAFirst generation: Only look at properties of single AA– Accuracy: 50-60%– E.g. Chou-Fasman (1974)

• Second generation: Include info. about neighborhood– Accuracy: ~65%– E.g. GOR (1974 – 1987)

• Third generation: Include info. from homologous seq’sA 70 75%– Accuracy: ~70-75%

– E.g. PHD (1994)

• Forth generation: Build ensembles of good methods• Forth generation: Build ensembles of good methods• Accuracy: ~80%• E.g. Jpred (1998)

g p ( )

GOR Algorithm[Garnier et al. Analysis of the accuracy and implications of simple methods for predicting the secondary

f l b l l f l l l 20( ) 9 8]structure of globular proteins, Jounal of Molecular Biology 120(1), 1978]

• In principle, GOR uses Pk values for 16-grams of AAsIn principle, GOR uses Pk values for 16 grams of AAs– Recall: Helices have a “reach” of ~4 AA– But neighbors in a strand can be arbitrarily far away from each

other (but they are not in practice)

• These cannot be learned from countingTh 1620 1 2E24 diff t h 16– There are 1620~1.2E24 different such 16-grams

• Different versions of GOR use different methods to estimate these valuesestimate these values– GOR IV uses propensities of all pairs in this window

• Other differenceOther difference– Use of negative information (chances of AA j not being in class k)– No cores+extension: Each AA is classified based on its 16-context

PHD [Rost et al.: PHD-an automatic mail server for protein SSP, Bioinformatics 10(1), 1994]

• Uses two neural networksUses two neural networks• Input is not only the AA and its context, but its profile

– Given the input sequence X, PHD first search homologous p q , gsequences (using PSI-BLAST)

– All these are subjected to a multiple sequence alignmentTh “ l ” f AA i X i it fil– The “column” of an AA in X is its profile

• RationaleOn average one can exchange ~65% of a protein without– On average, one can exchange ~65% of a protein without changing its secondary structure notably

– Thus, the concrete AA is not as important as one might think– Homologous sequences are believed to have the same function and

the same secondary structure– We do not classify the AA but the list of all its replacements

– We do not classify the AA, but the list of all its replacements

proteins: structure & function - hu-berlin.de · structure (cath, scop, fssp) • highly...

Documents

cardiac cath complications

cath presentation

upgrades to the fssp-100 electronics -...

food spoilage and safety predictor (fssp) software · 8.2...

cath church

facilities shared services programme (fssp) team brief...

gabriel baumann, fssp - rerogabriel baumann, fssp la...

cath procedure: coronary angiography - sickkids · cath...

cath lab essentials: basic hemodynamics for the cath lab...

facilities shared services programme (fssp) team brief...

financial management fssp application part 2 · pdf...

cath sisson

leglise du christ - fssp-bordeaux.fr

problems cath ch

fssp offre de cours / fsp offerta di corsi 2016

cath preso

food staples sufficiency program (fssp)

ordinaire de la sainte messe - fssp archidiocèse de lyon

quÆ sursum sunt sapite - fssp-bordeaux.fr

solving the non-permutation flow shop scheduling problem ·...