proteins: structure & function - hu-berlin.de · structure (cath, scop, fssp) • highly...
Post on 18-Jul-2020
9 Views
Preview:
TRANSCRIPT
Proteins:Proteins:Structure & Function
Ulf Leser
This Lecture
• Introduction– Structure– Function– Databases
• Predicting Protein Secondary Structure
• Many figures from Zvelebil, M. and Baum, J. O. (2008). "Understanding Bioinformatics", Garland Science, Taylor & Francis Group.E l f f O K hlb h V l S k h WS• Examples often from O. Kohlbacher, Vorlesung Strukturvorhersage, WS 2004/2005, Universität Tübingen
Ulf Leser: Bioinformatics, Summer Semester 2011 2
Central Dogma of Molecular Biologyg gy
Ulf Leser: Bioinformatics, Summer Semester 2011 3
The Real Picture
• Alternative Splicing• Alternative Splicing– “One gene – one protein” is wrong– Exons may be spliced from the protein sequence– Human: ~ 6 times more proteins than genes
• Post-translational modifications– (De-)Phosporylation, glycolysation, cleavage of signals, …– Estimates: 100K proteins, 500K protein forms
C l• Complexes– Proteins gather together to perform specific function
Ulf Leser: Bioinformatics, Summer Semester 2011 4
Example: Proteasomep
• Very large complexes present in allVery large complexes present in all eukaryotes (and more)– >2000 kDa, made of dozens of single
protein chains– Formation of the complex is a very
complex process only partly understood yetcomplex process only partly understood yet
• Breaks (mis-folded, broken, superfluous, …) proteins into small p , ) ppeptides for reuse– Suspicious proteins are tagged with
bi iti hi h bi d t th tubiquitin which binds to the proteasome
Ulf Leser: Bioinformatics, Summer Semester 2011 5
Protein Structure
• Primary– 1D-Seq. of AAq
• Secondary– 1D-Seq. of
“subfolds”
• Tertiary3D St t– 3D-Structure
• Quaternary3D Structure of– 3D-Structure of assembled complexes
Ulf Leser: Bioinformatics, Summer Semester 2011 6
Protein Function
• Proteins perform essentially everything that makes anProteins perform essentially everything that makes an organism alive– Metabolism– Signal processing– Gene regulation
Cell c cle– Cell cycle – …
• For ~1/3 of all human• For ~1/3 of all human gene, no function is known
• Describing functionDescribing function– Gene Ontology: 3 branches, >30.000 concepts– Used world-wide to describe gene/protein function
Ulf Leser: Bioinformatics, Summer Semester 2011 7
Protein Interactions and Networks
• Function usually is carried out by a complex interplayFunction usually is carried out by a complex interplay between many proteins and other molecules
• Pathways: (artificial) fragments of the cellular network y ( ) gassociated to a certain function– See lectures later
Ulf Leser: Bioinformatics, Summer Semester 2011 8
Function and Motifs
• A protein may “have”A protein may have many functions– Avg. n# of GO terms
assigned to a human protein: ~6
• Functions are carried out by substructures of a proteinC ll d tif d i– Called motifs or domains
• There probably exist only 4000-5000 motifsProteins: “Assemblies of functional motifs”– Proteins: Assemblies of functional motifs
• Performing a function often requires binding to another protein or moleculeprotein or molecule– The binding requires a certain constellation of the protein structure– Blocking such bindings is a major goal in pharmacological research
Ulf Leser: Bioinformatics, Summer Semester 2011 9
Protein Families and Classification
• Several DBs classify proteinsSeveral DBs classify proteins according to their overall structure (CATH, SCOP, FSSP)
• Highly related to evolutionary relationships (and function)
• Example: CATH– Class (all α, all β, α-β, other)
A hit t– Architecture– Topology (similar folds)– Homologous superfamilyHomologous superfamily
(sets of concrete proteins)
• Helpful to characterize novel proteins
Ulf Leser: Bioinformatics, Summer Semester 2011 10
Functional Classification
• Folds correlate with function, but many exceptionsFolds correlate with function, but many exceptions• Enzyme classification (EC-numbers)
– 4-level hierarchyy– Based on chemical
reactions that are catalyzedCl l t d t f ti– Closer related to functionthan classes of folds
– Relation protein:EC-number pis mostly 1:1
EC n mbe and GO• EC-number and GO-annotation are highly correlated– But >30 000 concepts versus <4000 EC-numbers
Ulf Leser: Bioinformatics, Summer Semester 2011 11
But >30.000 concepts versus <4000 EC numbers
Proteomics – Large Scale Protein Identificationg
• Measuring gene expression is comparably simpleMeasuring gene expression is comparably simple – Sequencing mRNA, microarrays (based on hybridization)
• Measuring proteins is much harderg p– Isolating proteins is very complex– Sequencing a protein is very slow
• Options (see lecture later)– Using 2D-Page
U i t t– Using mass spectrometry– De-dovo sequencing with MS/MS– Quantification is very difficultQuantification is very difficult
• Some classes of proteins are particularly hard to handle– Membrane proteins, non-soluble proteins
Ulf Leser: Bioinformatics, Summer Semester 2011 12
p , p
UniProt
• “Standard” database for protein sequences and annotationStandard database for protein sequences and annotation– Original name: SwissProt– Started at the Swiss Institute of Bioinformatics, now mostly EBI– Other: PIR, HPRD
• Continuous growth and curation– >30 „Scientific Database Curators“– Quarterly releases– Very rich annotation set– Very rich annotation set– Redundancy-free
• Actually two databasesActually two databases– SwissProt: Curated, high quality, versioned– TrEMBL: Automatic generation from (putative) coding genomic
Ulf Leser: Bioinformatics, Summer Semester 2011 13
sequences, low quality, redundant, much larger
UniProt: Species [http://www.expasy.org/sprot/relnotes/relstat.html]p
20258 Homo sapiens (Human) 16327 Mus musculus (Mouse) 9842 Arabidopsis thaliana (Mouse-ear cress)9842 Arabidopsis thaliana (Mouse ear cress) 7560 Rattus norvegicus (Rat) 6582 Saccharomyces cerevisiae (Baker's yeast) 5803 Bos taurus (Bovine)
Ulf Leser: Bioinformatics, Summer Semester 2011 14
( )…
PDB – Protein Structure Database
• Oldest protein database, evolved from a bookOldest protein database, evolved from a book• Contains all experimentally obtained protein 3D-structures
– Plus DNA, protein-ligand, complexes, …, p g , p ,– X-Ray (~75%), NMR (Nuclear magnetic resonance, ~23%)
• Still costly and slow techniques– Growth much smaller than that
of sequence-related DBs
Many problems with legacy• Many problems with legacydata and data formats
Ulf Leser: Bioinformatics, Summer Semester 2011 15
InterPro
• Integrated database of protein signatures, classifications,Integrated database of protein signatures, classifications, and motifs– Currently ~21.000 signatures
• Associates signatures with function (GO term)• InterProScan – quick identification of signatures in a
protein sequence– For a fast, first functional annotation
Ulf Leser: Bioinformatics, Summer Semester 2011 16
This Lecture
• IntroductionIntroduction• Predicting Protein Secondary Structure
– Secondary structure elementsy– Chou-Fasman– GOR IV – Other methods
Ulf Leser: Bioinformatics, Summer Semester 2011 17
Amino Acids
• An AA consists of a commonAn AA consists of a common„core“ and a specific residue– Amino group – NH2
– Central Cα - Carbon – CH– Carboxyl group – COOH
R id ( id h i )• Residues (side chains) vary greatly between AA
• Residues determine the• Residues determine the specific properties of a AA
Ulf Leser: Bioinformatics, Summer Semester 2011 18
Structure of a Protein
• The core forms the backbone of a protein (AA chain)The core forms the backbone of a protein (AA chain)• Main structure: Covalent
peptide bond between p pcarboxyl and amino group– Loss of H2O
Ulf Leser: Bioinformatics, Summer Semester 2011 19
Structure Flexibilityy
• In principle, every chemical bond can rotate freely, whichIn principle, every chemical bond can rotate freely, which would allow arbitrary structures to be formed
• In proteins things are much more restrictedp g– Peptide bound is “flat” – almost no torsion possible– Flexibility only in the Cα-flanking bonds φ and ψ
Ulf Leser: Bioinformatics, Summer Semester 2011 20
Ramachandran Plots
• Combinations of φ and ψ are highly constrainedCombinations of φ and ψ are highly constrained • Due to chemical properties of the backbone / side chains
• Two combinations are favored: α-helixes and β-sheetso o b a o a a o d α a d β• More detailed classifications exist• Angels lead to specific structures• Secondary structure
Ulf Leser: Bioinformatics, Summer Semester 2011 21
α-Helixes
• Comb. of angles forming a regularly structured helix
• Additional bonds between amino and carboxyl groups– Very stable structure
• May have two orientationsM t i ht h d d– Most are right-handed
• 3.4 AA per twistOft h t ti• Often short, sometimes very long
Ulf Leser: Bioinformatics, Summer Semester 2011 22
β-Sheetsβ
• Two linear and parallel stretches (β-strands)Two linear and parallel stretches (β strands)• Strands are bound together by hydrogen bounds• Can be parallel or anti-parallel (wrt. N/C terminus)Can be parallel or anti parallel (wrt. N/C terminus)
Ulf Leser: Bioinformatics, Summer Semester 2011 23
Other Substructures
• α-helixes and β-sheets form around 50-80% of a proteinα helixes and β sheets form around 50 80% of a protein• All other „glue“ parts are called loops or coils
– Usually not very important for the structure of the proteiny y p p– But very important for its function– Loops are often exposed on the surface and participate in binding
t th l lto other molecules
Ulf Leser: Bioinformatics, Summer Semester 2011 24
Importancep
• Secondary structure elements (SSE) are vital for the overall structure of a proteinp
• Thus, they often are evolutionary well conserved• SSE can be used to classify proteinsy p
– Such classes are highly associated to function
• Knowing secondary structure thus gives important clues to protein function
• Secondary structure prediction (SSP) is much simpler than 3D structure prediction– And 3D structure prediction can benefit a lot from a good SSP
Ulf Leser: Bioinformatics, Summer Semester 2011 25
Predicting Secondary Structureg y
• SSP: Given a protein sequence, assign each AA in theSSP: Given a protein sequence, assign each AA in the sequence one of the three classes Helix, E (Strand), or Coil
KVYGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFESNFNTHATNRNTDGSTDYGILQINSRWWCNDGRTPGSKNLCNIPCSALLSSDITASVNCAKKIASGGNGMNAWVAWRNRCKGTDVHAWIRGCRL
KVYGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFESNFNTHATNRNTD -----HHHHHHHHH-------------EEEEE----------------GSTDYGILQINSRWWCNDGRTPGSKNLCNIPCSALLSSDITASVNCAK ----EEEEEE--------------------------------HHHHHH KIASGGNGMNAWVAWRNRCKGTDVHAWIRGCRL HHH-------EEE--------------------
Ulf Leser: Bioinformatics, Summer Semester 2011 26
Classification
• Classification: Classify each AA into one of three classesy• Classification is a fundamental problem (not only in bioinformatics)
– Classify the readout of a microarray as diseased / healthyCl if b f di / di– Classify a subsequence of a genome as coding / non-coding
– Classify an email as spam / no spam
• A wealth of different techniques exist (Machine Learning)q ( g)– Naïve Bayes, Regression, Decision Trees, Support Vector Machines, Neural
Networks, Sequential Models, …
• Many of them use the same abstraction and can be exchanged easily• Many of them use the same abstraction and can be exchanged easily– Describe objects by features– Learn properties of feature values in the different classes– Derive a classification function on the features– Pre-requisite: Distribution of feature values are at least slightly different in
the different classes
Ulf Leser: Bioinformatics, Summer Semester 2011 27
This Lecture
• IntroductionIntroduction• Predicting Protein Secondary Structure
– Secondary structure elementsy– Chou-Fasman– Other methods
Ulf Leser: Bioinformatics, Summer Semester 2011 28
Chou-Fasmann Algorithm Ch & F (1974) P di ti f t i f ti Bi h i t 13Chou & Fasman (1974). Prediction of protein conformation. Biochemistry 13
• Observation: Different AA favor different foldsObservation: Different AA favor different folds– Different AA are more or less often in H, E, C– Different AA are more or less often within, starting, or ending a
stretch of H, E, C
• Chou-Fasman algorithm (rough idea)Cl ifi h AA i t E H l ifi d AA i d C– Classifies each AA into E or H; unclassified AA are assigned C
– Compute a score for the probability of any AA to be E / H– Basis: Relative frequencies of assignments in a set of sequencesBasis: Relative frequencies of assignments in a set of sequences
with known secondary structure– In principle, assigns each AA its most frequent class– Add constraints about minimal length of E, H stretches and several
other heuristics
Ulf Leser: Bioinformatics, Summer Semester 2011 29
Some Details [sketch, some heuristics omitted]
• Let fj k be the relative frequency of observing AA j in class kLet fj,k be the relative frequency of observing AA j in class k• Let fk be the average over all 20 fj,k values• Compute the propensity of AA j to be part of class k asCompute the propensity of AA j to be part of class k as
Pj,k=fj,k/fk
• Using Pj,k, classify each AA j for class k according into– Strong, normal, weak builder (Hα, hα, Iα)Strong, normal, weak builder (Hα, hα, Iα)– Strong, weak breaker (Bα, bα)– Indifferent (iα)
• For now, context (neighboring AAs) is ignored completely
Ulf Leser: Bioinformatics, Summer Semester 2011 30
Concrete Values
• Originally computed on only 15 proteins (1974)Originally computed on only 15 proteins (1974)
Ulf Leser: Bioinformatics, Summer Semester 2011 31
Algorithm for Helicesg
• Go through the protein sequenceGo through the protein sequence• Score each AA with 1 (Hα, hα), 0.5 (Iα, iα), or -1 (Bα, bα)• Find helix cores: subsequences of length 6 with anFind helix cores: subsequences of length 6 with an
aggregated AA score ≥ 4• Starting from the middle of each core, shift a window of g ,
length 4 to the left (then to the right)– Compute the aggregated Pk-score P inside the window– If P ≥ 4, continue; otherwise stop
• Similar method for strands• Conflicts (regions assigned both H and E) are resolved
based on aggregated Pk-scores
Ulf Leser: Bioinformatics, Summer Semester 2011 32
Examplep
Ulf Leser: Bioinformatics, Summer Semester 2011 33
Performance
• Accuracy app. 50-60%Accuracy app. 50 60%– Measured on per-AA correctness
• Prediction is more accurate in helices than in strands– Because helices build local bridges (hydrogen bounds between the
turns; each AA binds to the +4 AA)
G l bl• General problem– Secondary structure is not only a local problem
Looking only at single AAs is not enough– Looking only at single AAs is not enough• Note: Scores are based on individual AA; aggregation by summation
assumes statistical independence of pairs, triples … in a class
– One needs to include the context of an AA
Ulf Leser: Bioinformatics, Summer Semester 2011 34
This Lecture
• IntroductionIntroduction• Predicting Protein Secondary Structure
– Secondary structure elementsy– Chou-Fasman– Other methods
Ulf Leser: Bioinformatics, Summer Semester 2011 35
Classes of Methods
• First generation: Only look at properties of single AAFirst generation: Only look at properties of single AA– Accuracy: 50-60%– E.g. Chou-Fasman (1974)
• Second generation: Include info. about neighborhood– Accuracy: ~65%– E.g. GOR (1974 – 1987)
• Third generation: Include info. from homologous seq’sA 70 75%– Accuracy: ~70-75%
– E.g. PHD (1994)
• Forth generation: Build ensembles of good methods• Forth generation: Build ensembles of good methods• Accuracy: ~80%• E.g. Jpred (1998)
Ulf Leser: Bioinformatics, Summer Semester 2011 36
g p ( )
GOR Algorithm[Garnier et al. Analysis of the accuracy and implications of simple methods for predicting the secondary
f l b l l f l l l 20( ) 9 8]structure of globular proteins, Jounal of Molecular Biology 120(1), 1978]
• In principle, GOR uses Pk values for 16-grams of AAsIn principle, GOR uses Pk values for 16 grams of AAs– Recall: Helices have a “reach” of ~4 AA– But neighbors in a strand can be arbitrarily far away from each
other (but they are not in practice)
• These cannot be learned from countingTh 1620 1 2E24 diff t h 16– There are 1620~1.2E24 different such 16-grams
• Different versions of GOR use different methods to estimate these valuesestimate these values– GOR IV uses propensities of all pairs in this window
• Other differenceOther difference– Use of negative information (chances of AA j not being in class k)– No cores+extension: Each AA is classified based on its 16-context
Ulf Leser: Bioinformatics, Summer Semester 2011 37
PHD [Rost et al.: PHD-an automatic mail server for protein SSP, Bioinformatics 10(1), 1994]
• Uses two neural networksUses two neural networks• Input is not only the AA and its context, but its profile
– Given the input sequence X, PHD first search homologous p q , gsequences (using PSI-BLAST)
– All these are subjected to a multiple sequence alignmentTh “ l ” f AA i X i it fil– The “column” of an AA in X is its profile
• RationaleOn average one can exchange ~65% of a protein without– On average, one can exchange ~65% of a protein without changing its secondary structure notably
– Thus, the concrete AA is not as important as one might think– Homologous sequences are believed to have the same function and
the same secondary structure– We do not classify the AA but the list of all its replacements
Ulf Leser: Bioinformatics, Summer Semester 2011 38
– We do not classify the AA, but the list of all its replacements
Further Readingg
• Gerhard Steger (2003). “Bioinformatik – Methoden zurGerhard Steger (2003). Bioinformatik Methoden zurVorhersage von RNA- und Proteinstrukturen”, Birkhäuser, chapter 8,10,11,13
• Zvelebil, M. and Baum, J. O. (2008). "Understanding Bioinformatics", Garland Science, Taylor & Francis Group, h 2 11 12 ( l )chapter 2, 11, 12 (partly)
Ulf Leser: Bioinformatics, Summer Semester 2011 39
top related