Download - Classifying the protein universe
Classifying the protein Classifying the protein universe universe
Ashwin Sivakumar
Synapse-Associated Protein 97
Wu et al, 2002. EMBO J 19:5740-5751
Domain Analysis and Protein Domain Analysis and Protein FamiliesFamilies
IntroductionIntroductionWhatWhat are protein families? are protein families?
ProteinProtein families familiesDescription & DefinitionDescription & Definition
Motifs and ProfilesMotifs and Profiles
TheThe modular architecture of proteins modular architecture of proteins
Domain Properties and ClassificationDomain Properties and Classification
Protein families are defined by homology:Protein families are defined by homology: IIn a family, everyone is related to everyonen a family, everyone is related to everyone Everybody in a family shares a common Everybody in a family shares a common
ancestor:ancestor:
Protein FamiliesProtein Families
Protein family 1 Protein family 2
Homology versus SimilarityHomology versus Similarity
HomologousHomologous proteins have similar 3D proteins have similar 3D structures and (usually) share common structures and (usually) share common ancestry:ancestry:
1chg and 1sgt 1chg and 1sgt 31% identity, 43% 31% identity, 43% similaritysimilarity
We can We can inferinfer homology from similarity! homology from similarity!
1chg
1sgt
1chg
1sgt
Superfamily: Trypsin-like Serine Proteases
Homology versus SimilarityHomology versus Similarity
ButBut Homologous proteins may not Homologous proteins may not share sequence similarity:share sequence similarity:
1chg
1sgc
1chg
1sgc
Superfamily: Trypsin-like Serine Proteases
1chg and 1sgc 1chg and 1sgc 15% identity, 25% similarity 15% identity, 25% similarityWe We cannotcannot infer similarity from homology infer similarity from homology
Homology versus SimilarityHomology versus Similarity SimilarSimilar sequences may not have sequences may not have
structural similarity:structural similarity:
1chg
1chg
2baa
2baa
1chg and 2baa 1chg and 2baa 30% similarity, 140/245 30% similarity, 140/245 aaaaWe cannot We cannot assumeassume homology from homology from similarity!similarity!
Homology versus SimilarityHomology versus Similarity
SummarySummary– Sequences can be similar without being homologousSequences can be similar without being homologous– Sequences can be homologous without being similarSequences can be homologous without being similar
Evolution /Homology
BLASTSimilarit
y
Families ??
Domain Analysis and Protein Domain Analysis and Protein FamiliesFamilies
IntroductionIntroductionWhatWhat are protein families? are protein families?
ProteinProtein families familiesDescription & DefinitionDescription & Definition
Motifs and ProfilesMotifs and Profiles
The modular architecture of proteinsThe modular architecture of proteins
Domain Properties and ClassificationDomain Properties and Classification
Description of a Protein Description of a Protein FamilyFamily
Let’s assume we know some members Let’s assume we know some members of a protein familyof a protein family
What is common to them all?What is common to them all? Multiple alignment!Multiple alignment!
Describing Sequences in a Describing Sequences in a Protein FamilyProtein Family
As a motif or ruleAs a motif or ruledescribes essential features of the protein describes essential features of the protein familyfamily
catalytic residues, important structural catalytic residues, important structural residuesresidues
As a profileAs a profiledescribes variability in the family alignmentdescribes variability in the family alignment
Techniques for searching sequence databases to
Some common strategies to uncover common domains/motifs of biological significance that categorize a protein into a family
• Pattern - a deterministic syntax that describes multiple combinations of possible residues within a protein string
• Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur
Consensus - mathematical probability that a particular amino acid will be located at a given position.
• Probabilistic pattern constructed from a MSA. Opportunity to assign penalties for insertions and deletions
• PSSM - (Position Specific Scoring Matrix)
– Represents the sequence profile in tabular form
– Columns of weights for every aa corresponding to each column of a MSA.
HMMsHMMs Hidden Markov Models are Statistical
methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000)
•Sequence ordering and alignments are not necessary at the onset (but in many cases alignments are recommended)
More the number of sequences better the models.
One can Generate a model (profile/PSSM), then search a database with it (Eg: PFAM)
Motif Description of a Motif Description of a Protein FamilyProtein Family
Regular expressions:Regular expressions:
........C.............S...L..I..DRY..I.......................W... I E W V
/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /
x = [AC-IK-NP-TVWY]
Motif Description of a Motif Description of a Protein FamilyProtein Family
Database: PROSITEDatabase: PROSITE““PROSITE is a database of protein families and domains. It is PROSITE is a database of protein families and domains. It is based on the observation that, while there is a huge number of based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a generally share functional attributes and are derived from a common ancestor. It is apparent, when studying protein common ancestor. It is apparent, when studying protein sequence families, that some regions have been better sequence families, that some regions have been better conserved than others during evolution. These regions are conserved than others during evolution. These regions are generally important for the function of a protein and/or for the generally important for the function of a protein and/or for the maintenance of its three-dimensional structure. By analyzing the maintenance of its three-dimensional structure. By analyzing the constant and variable properties of such groups of similar constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other family or domain, which distinguishes its members from all other unrelated proteins.unrelated proteins.””
http://au.expasy.org/prosite/prosite_details.htmlhttp://au.expasy.org/prosite/prosite_details.html
Automated Motif DiscoveryAutomated Motif Discovery
Given a set of sequences:Given a set of sequences:
GIBBS SamplerGIBBS Sampler http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?
data_type=proteindata_type=protein
MEMEMEME http://meme.sdsc.edu/meme/http://meme.sdsc.edu/meme/
PRATTPRATT http://www.ebi.ac.uk/pratthttp://www.ebi.ac.uk/pratt
TEIRESIASTEIRESIAS http://cbcsrv.watson.ibm.com/Tspd.htmlhttp://cbcsrv.watson.ibm.com/Tspd.html
Automated Profile GenerationAutomated Profile Generation
Any multiple alignment is a profile!Any multiple alignment is a profile!
PSIBLASTPSIBLASTAlgorithm:Algorithm: Start from a single query sequenceStart from a single query sequence Perform BLAST searchPerform BLAST search Build profile of neighboursBuild profile of neighbours Repeat from 2 …Repeat from 2 …
Very sensitive method for database Very sensitive method for database searchsearch
PSI-BlastPSI-Blast
Starts with a sequence, BLAST it, align select results to query sequence,
estimate a profile with the MSA, search database with the profile - constructs PSSM
Iterate until process stabilizes Focus here is on domains, not entire
sequences Greatly improves sensitivity
PPosition osition SSpecific pecific IIterative terative BlastBlast
PSIBLASTPSIBLAST
Threshold for inclusion in profile
Query Profile1 Profile2
...After n iterations
Benchmarking a motif/profileBenchmarking a motif/profile
You have a description of a protein You have a description of a protein family, and you do a database search…family, and you do a database search…
Are all hits truly members of your Are all hits truly members of your protein family?protein family?
Benchmarking:Benchmarking:
Datasetunknown
family membernot a family member
TP: true positiveTN: true negativeFP: false positiveFN: false negative
Result
Precision / SelectivityPrecision / SelectivityPrecision = TP / (TP + FP)Precision = TP / (TP + FP)
Sensitivity / RecallSensitivity / RecallSensitivity = TP / (TP + FN)Sensitivity = TP / (TP + FN)
Balancing both:Balancing both:Precision ~ 1, Recall ~ 0: easy but uselessPrecision ~ 1, Recall ~ 0: easy but useless
Precision ~ 0, Recall ~ 1: easy but uselessPrecision ~ 0, Recall ~ 1: easy but useless
Precision ~ 1, Recall ~ 1: perfect but very Precision ~ 1, Recall ~ 1: perfect but very difficultdifficult
Benchmarking a motif/profileBenchmarking a motif/profile
Domain Analysis and Protein Domain Analysis and Protein FamiliesFamilies
IntroductionIntroductionWhatWhat are protein families? are protein families?
ProteinProtein families familiesDescription & DefinitionDescription & Definition
Motifs and ProfilesMotifs and Profiles
The modular architecture of The modular architecture of proteinsproteins
Domain Properties and ClassificationDomain Properties and Classification
The Modular The Modular Architecture of Architecture of
ProteinsProteins BLAST search of a multi-domain proteinBLAST search of a multi-domain protein
Phosphoglycerate kinase Triosephosphate isomerase
FunctionalFunctional - from - from experiments:experiments:
exampleexample: Decay Accelerating : Decay Accelerating Factor (DAF) or CD55Factor (DAF) or CD55
What are domains?What are domains?
Has six domains (units): 4x Sushi domain (complement
regulation)
1x ST-rich ‘stalk’
1x GPI anchor (membrane attachment)
PDB entry 1ojy (sushi domains only)
P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696
There is only so much we can There is only so much we can conclude…conclude…
Classifying domains [To aid structure Classifying domains [To aid structure prediction (predict structural domains, prediction (predict structural domains, molecular function of the domain)]molecular function of the domain)]
Classifying complete sequences (predicting Classifying complete sequences (predicting molecular function of proteins, large scale molecular function of proteins, large scale annotation)annotation)
Majority of proteins are multi-domain proteins.Majority of proteins are multi-domain proteins.
StructuralStructural - from - from structures:structures:
What are domains?What are domains?
MKTQVAIIGAGPSGLLLGQLLHKAGIDNVILERQTPDYVLGRIRAGVLEQGMVDLLREAGVDRRMARDGLVHEGVEIAFAGQRRRIDLKRLSGGKTVTVYGQTEVTRDLMEAREACGATTVYQAAEVRLHDLQGERPYVTFERDGERLRLDCDYIAGCDGFHGISRQSIPAERLKVFERVYPFGWLGLLADTPPVSHELIYANHPRGFALCSQRSATRSRYYVQVPLTEKVEDWSDERFWTELKARLPAEVAEKLVTGPSLEKSIAPLRSFVVEPMQHGRLFLAGDAAHIVPPTGAKGLNLAASDVSTLYRLLLKAYREGRGELLERYSAICLRRIWKAERFSWWMTSVLHRFPDTDAFSQRIQQTELEYYLGSEAGLATIAENYVGLPYEEIE
1phh
Are these domains?
Yes - structural domains!M A Marti-Renom (2003) Identification of Structural Domains in Proteins. DIMACS, Rutgers University, Piscataway, NJ,
Feb 27 2003.
MobileMobile – Sequence Domains: – Sequence Domains:
What are domains?What are domains?
Mobile module
Protein 1
Protein 2
Protein 3
Protein 4
Domains are...Domains are... ...evolutionary building blocks:...evolutionary building blocks:
FamiliesFamilies of evolutionarily-related sequence of evolutionarily-related sequence segmentssegments
Domain assignment often coupled with classificationDomain assignment often coupled with classification With one or more of the following properties:With one or more of the following properties:
GlobularGlobular
Independently foldableIndependently foldable
Recurrence in different contextsRecurrence in different contexts To be precise,To be precise,
we say: “protein family”we say: “protein family”
we mean: “protein we mean: “protein domaindomain family”family”
Example: global alignmentExample: global alignment
Phthalate dioxygenase Phthalate dioxygenase reductase reductase (PDR_BURCE)(PDR_BURCE)
Toluene - 4 -Toluene - 4 -monooxygenase monooxygenase electron transfer electron transfer component component (TMOF_PSEME)(TMOF_PSEME)
Global alignment fails!Only aligns largest domain.
Sometimes even more Sometimes even more complex!complex!
PGBM_HUMAN: “Basement membrane-specific heparan sulphate proteoglycan core protein precursor”
http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html
980
1960
2940
3920
4391
45 domains of 9 different type, according to PFam
Domain Analysis and Protein Domain Analysis and Protein FamiliesFamilies
IntroductionIntroductionWhatWhat are protein families? are protein families?
ProteinProtein families familiesDescription & DefinitionDescription & Definition
Motifs and ProfilesMotifs and Profiles
The modular architecture of proteinsThe modular architecture of proteins
Domain Properties and Domain Properties and ClassificationClassification
Categories of Domain Categories of Domain DefinitionsDefinitions
Sequence(continuous domains)
Structure(discontinuous
domains)
Curated
Automatic
SCOP
CATH
DALIPUUDETEKTIVEDOMAINPARSER 1 & 2DIALSTRUDLDOMAK
PFAMSMARTPROSITEPRINTS
ADDADOMOTRIBE-MCLGENERAGESYSTERSPROTOMAP
Pfam-Protein family database
7973 Families of HMM profiles built from hand curated multiple alignments. (Pfam A)
Pfam A covers 7973 protein families.
You can search your sequence against these profiles to decipher family membership for your sequence.
Why we need to consider domains:Why we need to consider domains:
Sequence Space GraphSequence Space Graph
Sequence
Alignment
Topology:● 80% of all
sequences in one giant component
● 10% smaller groups● 10% in singletons
Automatic domain definitionsAutomatic domain definitions
Rely on alignment Rely on alignment informationinformation
Alignment information is Alignment information is unreliableunreliable
Incomplete sequences Incomplete sequences (fragments)(fragments)
Spurious alignmentsSpurious alignments
Conserved motifs in Conserved motifs in mostly disordered regionmostly disordered region
How to remove the How to remove the noise?noise?
Distant relatives
UREA_CANEN: three domain protein
Sequence Space Graph:
•Where to cut connections?
•What is real, what is noise?
•Precision vs Sensitivity…
ADDAADDA HolmGroup in-house database!HolmGroup in-house database!
http://ekhidna.biocenter.helsinki.fi:9801/sqgraph/pairsdbhttp://ekhidna.biocenter.helsinki.fi:9801/sqgraph/pairsdb
Classification of non-redundant sequencesClassification of non-redundant sequences100% level: 1562243 sequences, 2697368 100% level: 1562243 sequences, 2697368 domainsdomains40% level: 479740 sequences, 827925 domains40% level: 479740 sequences, 827925 domains
PFAM-A benchmarkPFAM-A benchmarkSensitivity: 87% (average unification in single Sensitivity: 87% (average unification in single cluster)cluster)Selectivity: 98% (average purity of cluster)Selectivity: 98% (average purity of cluster)Coverage: 100% (all known proteins) [ Coverage: 100% (all known proteins) [ Pfam Pfam ~50%~50% ] ]
PFAMPRODOMDOMOADDA
Example: ABC transporterExample: ABC transporter
UniProt id: CFTR_BOVIN
Most domains: size approx 75 – 200 residuesMost domains: size approx 75 – 200 residues
Properties of domainsProperties of domains
So, you have a sequence...So, you have a sequence...
...look it up in existing database...look it up in existing database– SRS: http://srs.ebi.ac.ukSRS: http://srs.ebi.ac.uk– INTERPRO: INTERPRO: http://www.ebi.ac.uk/interprohttp://www.ebi.ac.uk/interpro
...search against existing family ...search against existing family descriptionsdescriptions
– PFAM: PFAM: http://www.sanger.ac.uk/Software/Pfamhttp://www.sanger.ac.uk/Software/Pfam– SMART: SMART: http://smart.embl-heidelberg.dehttp://smart.embl-heidelberg.de– PRINTS: http://bioinf.man.ac.uk/dbbrowser/PRINTSPRINTS: http://bioinf.man.ac.uk/dbbrowser/PRINTS– PROSITE: http://us.expasy.org/prositePROSITE: http://us.expasy.org/prosite
...look it up in ADDA...look it up in ADDA
Manually Curated Protein Manually Curated Protein Family DatabasesFamily Databases
PFAM (Hidden Markov Models)PFAM (Hidden Markov Models)– http://www.sanger.ac.uk/Software/Pfamhttp://www.sanger.ac.uk/Software/Pfam
SMART (Hidden Markov Models)SMART (Hidden Markov Models)– http://smart.embl-heidelberg.dehttp://smart.embl-heidelberg.de
PROSITE (Regular Expressions, Profiles)PROSITE (Regular Expressions, Profiles)– http://au.expasy.org/prositehttp://au.expasy.org/prosite
PRINTS (combination of Profiles)PRINTS (combination of Profiles)– http://bioinf.man.ac.uk/dbbrowser/PRINTShttp://bioinf.man.ac.uk/dbbrowser/PRINTS
Why a multiple alignment?Why a multiple alignment?
With a multiple alignment, we canWith a multiple alignment, we canguess which residues are “important”guess which residues are “important” secondary structure predictionsecondary structure prediction transmembrane segments predictiontransmembrane segments prediction homology modellinghomology modelling guide to wet-lab EXPERIMENTATION!guide to wet-lab EXPERIMENTATION!
build a motif/profile and find more family build a motif/profile and find more family membersmembers
build phylogenetic treesbuild phylogenetic trees
Multiple Alignments are THE central object in protein
sequence analysis!
From sequence to function…From sequence to function…
Methylmalanoyl CoA Decarboxylase Pattern [ILV]-x(3)-E-x(7)-V-[GA]-x-[IVL]-x-L-N-R-P mapped on the structure of 1DUB. Ball representation in pink shows the potential ligands and its binding pockets. The balls in blue represent the residues making up the motif on the known structure.
3-motif resource
The server seems to be down today!