![Page 1: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/1.jpg)
Classifying the protein Classifying the protein universe universe
Ashwin Sivakumar
Synapse-Associated Protein 97
Wu et al, 2002. EMBO J 19:5740-5751
![Page 2: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/2.jpg)
Domain Analysis and Protein Domain Analysis and Protein FamiliesFamilies
IntroductionIntroductionWhatWhat are protein families? are protein families?
ProteinProtein families familiesDescription & DefinitionDescription & Definition
Motifs and ProfilesMotifs and Profiles
TheThe modular architecture of proteins modular architecture of proteins
Domain Properties and ClassificationDomain Properties and Classification
![Page 3: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/3.jpg)
Protein families are defined by homology:Protein families are defined by homology: IIn a family, everyone is related to everyonen a family, everyone is related to everyone Everybody in a family shares a common Everybody in a family shares a common
ancestor:ancestor:
Protein FamiliesProtein Families
Protein family 1 Protein family 2
![Page 4: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/4.jpg)
Homology versus SimilarityHomology versus Similarity
HomologousHomologous proteins have similar 3D proteins have similar 3D structures and (usually) share common structures and (usually) share common ancestry:ancestry:
1chg and 1sgt 1chg and 1sgt 31% identity, 43% 31% identity, 43% similaritysimilarity
We can We can inferinfer homology from similarity! homology from similarity!
1chg
1sgt
1chg
1sgt
Superfamily: Trypsin-like Serine Proteases
![Page 5: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/5.jpg)
Homology versus SimilarityHomology versus Similarity
ButBut Homologous proteins may not Homologous proteins may not share sequence similarity:share sequence similarity:
1chg
1sgc
1chg
1sgc
Superfamily: Trypsin-like Serine Proteases
1chg and 1sgc 1chg and 1sgc 15% identity, 25% similarity 15% identity, 25% similarityWe We cannotcannot infer similarity from homology infer similarity from homology
![Page 6: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/6.jpg)
Homology versus SimilarityHomology versus Similarity SimilarSimilar sequences may not have sequences may not have
structural similarity:structural similarity:
1chg
1chg
2baa
2baa
1chg and 2baa 1chg and 2baa 30% similarity, 140/245 30% similarity, 140/245 aaaaWe cannot We cannot assumeassume homology from homology from similarity!similarity!
![Page 7: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/7.jpg)
Homology versus SimilarityHomology versus Similarity
SummarySummary– Sequences can be similar without being homologousSequences can be similar without being homologous– Sequences can be homologous without being similarSequences can be homologous without being similar
Evolution /Homology
BLASTSimilarit
y
Families ??
![Page 8: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/8.jpg)
Domain Analysis and Protein Domain Analysis and Protein FamiliesFamilies
IntroductionIntroductionWhatWhat are protein families? are protein families?
ProteinProtein families familiesDescription & DefinitionDescription & Definition
Motifs and ProfilesMotifs and Profiles
The modular architecture of proteinsThe modular architecture of proteins
Domain Properties and ClassificationDomain Properties and Classification
![Page 9: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/9.jpg)
Description of a Protein Description of a Protein FamilyFamily
Let’s assume we know some members Let’s assume we know some members of a protein familyof a protein family
What is common to them all?What is common to them all? Multiple alignment!Multiple alignment!
![Page 10: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/10.jpg)
Describing Sequences in a Describing Sequences in a Protein FamilyProtein Family
As a motif or ruleAs a motif or ruledescribes essential features of the protein describes essential features of the protein familyfamily
catalytic residues, important structural catalytic residues, important structural residuesresidues
As a profileAs a profiledescribes variability in the family alignmentdescribes variability in the family alignment
![Page 11: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/11.jpg)
Techniques for searching sequence databases to
Some common strategies to uncover common domains/motifs of biological significance that categorize a protein into a family
• Pattern - a deterministic syntax that describes multiple combinations of possible residues within a protein string
• Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur
![Page 12: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/12.jpg)
Consensus - mathematical probability that a particular amino acid will be located at a given position.
• Probabilistic pattern constructed from a MSA. Opportunity to assign penalties for insertions and deletions
• PSSM - (Position Specific Scoring Matrix)
– Represents the sequence profile in tabular form
– Columns of weights for every aa corresponding to each column of a MSA.
![Page 13: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/13.jpg)
HMMsHMMs Hidden Markov Models are Statistical
methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000)
•Sequence ordering and alignments are not necessary at the onset (but in many cases alignments are recommended)
More the number of sequences better the models.
One can Generate a model (profile/PSSM), then search a database with it (Eg: PFAM)
![Page 14: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/14.jpg)
Motif Description of a Motif Description of a Protein FamilyProtein Family
Regular expressions:Regular expressions:
........C.............S...L..I..DRY..I.......................W... I E W V
/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /
x = [AC-IK-NP-TVWY]
![Page 15: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/15.jpg)
Motif Description of a Motif Description of a Protein FamilyProtein Family
Database: PROSITEDatabase: PROSITE““PROSITE is a database of protein families and domains. It is PROSITE is a database of protein families and domains. It is based on the observation that, while there is a huge number of based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a generally share functional attributes and are derived from a common ancestor. It is apparent, when studying protein common ancestor. It is apparent, when studying protein sequence families, that some regions have been better sequence families, that some regions have been better conserved than others during evolution. These regions are conserved than others during evolution. These regions are generally important for the function of a protein and/or for the generally important for the function of a protein and/or for the maintenance of its three-dimensional structure. By analyzing the maintenance of its three-dimensional structure. By analyzing the constant and variable properties of such groups of similar constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other family or domain, which distinguishes its members from all other unrelated proteins.unrelated proteins.””
http://au.expasy.org/prosite/prosite_details.htmlhttp://au.expasy.org/prosite/prosite_details.html
![Page 16: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/16.jpg)
Automated Motif DiscoveryAutomated Motif Discovery
Given a set of sequences:Given a set of sequences:
GIBBS SamplerGIBBS Sampler http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?
data_type=proteindata_type=protein
MEMEMEME http://meme.sdsc.edu/meme/http://meme.sdsc.edu/meme/
PRATTPRATT http://www.ebi.ac.uk/pratthttp://www.ebi.ac.uk/pratt
TEIRESIASTEIRESIAS http://cbcsrv.watson.ibm.com/Tspd.htmlhttp://cbcsrv.watson.ibm.com/Tspd.html
![Page 17: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/17.jpg)
Automated Profile GenerationAutomated Profile Generation
Any multiple alignment is a profile!Any multiple alignment is a profile!
PSIBLASTPSIBLASTAlgorithm:Algorithm: Start from a single query sequenceStart from a single query sequence Perform BLAST searchPerform BLAST search Build profile of neighboursBuild profile of neighbours Repeat from 2 …Repeat from 2 …
Very sensitive method for database Very sensitive method for database searchsearch
![Page 18: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/18.jpg)
PSI-BlastPSI-Blast
Starts with a sequence, BLAST it, align select results to query sequence,
estimate a profile with the MSA, search database with the profile - constructs PSSM
Iterate until process stabilizes Focus here is on domains, not entire
sequences Greatly improves sensitivity
![Page 19: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/19.jpg)
PPosition osition SSpecific pecific IIterative terative BlastBlast
PSIBLASTPSIBLAST
Threshold for inclusion in profile
Query Profile1 Profile2
...After n iterations
![Page 20: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/20.jpg)
Benchmarking a motif/profileBenchmarking a motif/profile
You have a description of a protein You have a description of a protein family, and you do a database search…family, and you do a database search…
Are all hits truly members of your Are all hits truly members of your protein family?protein family?
Benchmarking:Benchmarking:
Datasetunknown
family membernot a family member
TP: true positiveTN: true negativeFP: false positiveFN: false negative
Result
![Page 21: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/21.jpg)
Precision / SelectivityPrecision / SelectivityPrecision = TP / (TP + FP)Precision = TP / (TP + FP)
Sensitivity / RecallSensitivity / RecallSensitivity = TP / (TP + FN)Sensitivity = TP / (TP + FN)
Balancing both:Balancing both:Precision ~ 1, Recall ~ 0: easy but uselessPrecision ~ 1, Recall ~ 0: easy but useless
Precision ~ 0, Recall ~ 1: easy but uselessPrecision ~ 0, Recall ~ 1: easy but useless
Precision ~ 1, Recall ~ 1: perfect but very Precision ~ 1, Recall ~ 1: perfect but very difficultdifficult
Benchmarking a motif/profileBenchmarking a motif/profile
![Page 22: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/22.jpg)
Domain Analysis and Protein Domain Analysis and Protein FamiliesFamilies
IntroductionIntroductionWhatWhat are protein families? are protein families?
ProteinProtein families familiesDescription & DefinitionDescription & Definition
Motifs and ProfilesMotifs and Profiles
The modular architecture of The modular architecture of proteinsproteins
Domain Properties and ClassificationDomain Properties and Classification
![Page 23: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/23.jpg)
The Modular The Modular Architecture of Architecture of
ProteinsProteins BLAST search of a multi-domain proteinBLAST search of a multi-domain protein
Phosphoglycerate kinase Triosephosphate isomerase
![Page 24: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/24.jpg)
FunctionalFunctional - from - from experiments:experiments:
exampleexample: Decay Accelerating : Decay Accelerating Factor (DAF) or CD55Factor (DAF) or CD55
What are domains?What are domains?
Has six domains (units): 4x Sushi domain (complement
regulation)
1x ST-rich ‘stalk’
1x GPI anchor (membrane attachment)
PDB entry 1ojy (sushi domains only)
P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696
![Page 25: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/25.jpg)
There is only so much we can There is only so much we can conclude…conclude…
Classifying domains [To aid structure Classifying domains [To aid structure prediction (predict structural domains, prediction (predict structural domains, molecular function of the domain)]molecular function of the domain)]
Classifying complete sequences (predicting Classifying complete sequences (predicting molecular function of proteins, large scale molecular function of proteins, large scale annotation)annotation)
Majority of proteins are multi-domain proteins.Majority of proteins are multi-domain proteins.
![Page 26: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/26.jpg)
StructuralStructural - from - from structures:structures:
What are domains?What are domains?
MKTQVAIIGAGPSGLLLGQLLHKAGIDNVILERQTPDYVLGRIRAGVLEQGMVDLLREAGVDRRMARDGLVHEGVEIAFAGQRRRIDLKRLSGGKTVTVYGQTEVTRDLMEAREACGATTVYQAAEVRLHDLQGERPYVTFERDGERLRLDCDYIAGCDGFHGISRQSIPAERLKVFERVYPFGWLGLLADTPPVSHELIYANHPRGFALCSQRSATRSRYYVQVPLTEKVEDWSDERFWTELKARLPAEVAEKLVTGPSLEKSIAPLRSFVVEPMQHGRLFLAGDAAHIVPPTGAKGLNLAASDVSTLYRLLLKAYREGRGELLERYSAICLRRIWKAERFSWWMTSVLHRFPDTDAFSQRIQQTELEYYLGSEAGLATIAENYVGLPYEEIE
1phh
Are these domains?
Yes - structural domains!M A Marti-Renom (2003) Identification of Structural Domains in Proteins. DIMACS, Rutgers University, Piscataway, NJ,
Feb 27 2003.
![Page 27: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/27.jpg)
MobileMobile – Sequence Domains: – Sequence Domains:
What are domains?What are domains?
Mobile module
Protein 1
Protein 2
Protein 3
Protein 4
![Page 28: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/28.jpg)
Domains are...Domains are... ...evolutionary building blocks:...evolutionary building blocks:
FamiliesFamilies of evolutionarily-related sequence of evolutionarily-related sequence segmentssegments
Domain assignment often coupled with classificationDomain assignment often coupled with classification With one or more of the following properties:With one or more of the following properties:
GlobularGlobular
Independently foldableIndependently foldable
Recurrence in different contextsRecurrence in different contexts To be precise,To be precise,
we say: “protein family”we say: “protein family”
we mean: “protein we mean: “protein domaindomain family”family”
![Page 29: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/29.jpg)
Example: global alignmentExample: global alignment
Phthalate dioxygenase Phthalate dioxygenase reductase reductase (PDR_BURCE)(PDR_BURCE)
Toluene - 4 -Toluene - 4 -monooxygenase monooxygenase electron transfer electron transfer component component (TMOF_PSEME)(TMOF_PSEME)
Global alignment fails!Only aligns largest domain.
![Page 30: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/30.jpg)
Sometimes even more Sometimes even more complex!complex!
PGBM_HUMAN: “Basement membrane-specific heparan sulphate proteoglycan core protein precursor”
http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html
980
1960
2940
3920
4391
45 domains of 9 different type, according to PFam
![Page 31: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/31.jpg)
Domain Analysis and Protein Domain Analysis and Protein FamiliesFamilies
IntroductionIntroductionWhatWhat are protein families? are protein families?
ProteinProtein families familiesDescription & DefinitionDescription & Definition
Motifs and ProfilesMotifs and Profiles
The modular architecture of proteinsThe modular architecture of proteins
Domain Properties and Domain Properties and ClassificationClassification
![Page 32: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/32.jpg)
Categories of Domain Categories of Domain DefinitionsDefinitions
Sequence(continuous domains)
Structure(discontinuous
domains)
Curated
Automatic
SCOP
CATH
DALIPUUDETEKTIVEDOMAINPARSER 1 & 2DIALSTRUDLDOMAK
PFAMSMARTPROSITEPRINTS
ADDADOMOTRIBE-MCLGENERAGESYSTERSPROTOMAP
![Page 33: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/33.jpg)
Pfam-Protein family database
7973 Families of HMM profiles built from hand curated multiple alignments. (Pfam A)
Pfam A covers 7973 protein families.
You can search your sequence against these profiles to decipher family membership for your sequence.
![Page 34: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/34.jpg)
Why we need to consider domains:Why we need to consider domains:
Sequence Space GraphSequence Space Graph
Sequence
Alignment
Topology:● 80% of all
sequences in one giant component
● 10% smaller groups● 10% in singletons
![Page 35: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/35.jpg)
Automatic domain definitionsAutomatic domain definitions
Rely on alignment Rely on alignment informationinformation
Alignment information is Alignment information is unreliableunreliable
Incomplete sequences Incomplete sequences (fragments)(fragments)
Spurious alignmentsSpurious alignments
Conserved motifs in Conserved motifs in mostly disordered regionmostly disordered region
How to remove the How to remove the noise?noise?
Distant relatives
UREA_CANEN: three domain protein
![Page 36: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/36.jpg)
Sequence Space Graph:
•Where to cut connections?
•What is real, what is noise?
•Precision vs Sensitivity…
![Page 37: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/37.jpg)
ADDAADDA HolmGroup in-house database!HolmGroup in-house database!
http://ekhidna.biocenter.helsinki.fi:9801/sqgraph/pairsdbhttp://ekhidna.biocenter.helsinki.fi:9801/sqgraph/pairsdb
Classification of non-redundant sequencesClassification of non-redundant sequences100% level: 1562243 sequences, 2697368 100% level: 1562243 sequences, 2697368 domainsdomains40% level: 479740 sequences, 827925 domains40% level: 479740 sequences, 827925 domains
PFAM-A benchmarkPFAM-A benchmarkSensitivity: 87% (average unification in single Sensitivity: 87% (average unification in single cluster)cluster)Selectivity: 98% (average purity of cluster)Selectivity: 98% (average purity of cluster)Coverage: 100% (all known proteins) [ Coverage: 100% (all known proteins) [ Pfam Pfam ~50%~50% ] ]
![Page 38: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/38.jpg)
PFAMPRODOMDOMOADDA
Example: ABC transporterExample: ABC transporter
UniProt id: CFTR_BOVIN
![Page 39: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/39.jpg)
Most domains: size approx 75 – 200 residuesMost domains: size approx 75 – 200 residues
Properties of domainsProperties of domains
![Page 40: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/40.jpg)
So, you have a sequence...So, you have a sequence...
...look it up in existing database...look it up in existing database– SRS: http://srs.ebi.ac.ukSRS: http://srs.ebi.ac.uk– INTERPRO: INTERPRO: http://www.ebi.ac.uk/interprohttp://www.ebi.ac.uk/interpro
...search against existing family ...search against existing family descriptionsdescriptions
– PFAM: PFAM: http://www.sanger.ac.uk/Software/Pfamhttp://www.sanger.ac.uk/Software/Pfam– SMART: SMART: http://smart.embl-heidelberg.dehttp://smart.embl-heidelberg.de– PRINTS: http://bioinf.man.ac.uk/dbbrowser/PRINTSPRINTS: http://bioinf.man.ac.uk/dbbrowser/PRINTS– PROSITE: http://us.expasy.org/prositePROSITE: http://us.expasy.org/prosite
...look it up in ADDA...look it up in ADDA
![Page 41: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/41.jpg)
Manually Curated Protein Manually Curated Protein Family DatabasesFamily Databases
PFAM (Hidden Markov Models)PFAM (Hidden Markov Models)– http://www.sanger.ac.uk/Software/Pfamhttp://www.sanger.ac.uk/Software/Pfam
SMART (Hidden Markov Models)SMART (Hidden Markov Models)– http://smart.embl-heidelberg.dehttp://smart.embl-heidelberg.de
PROSITE (Regular Expressions, Profiles)PROSITE (Regular Expressions, Profiles)– http://au.expasy.org/prositehttp://au.expasy.org/prosite
PRINTS (combination of Profiles)PRINTS (combination of Profiles)– http://bioinf.man.ac.uk/dbbrowser/PRINTShttp://bioinf.man.ac.uk/dbbrowser/PRINTS
![Page 42: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/42.jpg)
Why a multiple alignment?Why a multiple alignment?
With a multiple alignment, we canWith a multiple alignment, we canguess which residues are “important”guess which residues are “important” secondary structure predictionsecondary structure prediction transmembrane segments predictiontransmembrane segments prediction homology modellinghomology modelling guide to wet-lab EXPERIMENTATION!guide to wet-lab EXPERIMENTATION!
build a motif/profile and find more family build a motif/profile and find more family membersmembers
build phylogenetic treesbuild phylogenetic trees
Multiple Alignments are THE central object in protein
sequence analysis!
![Page 43: Classifying the protein universe Ashwin Sivakumar Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751](https://reader030.vdocuments.site/reader030/viewer/2022012910/56649d9f5503460f94a8a697/html5/thumbnails/43.jpg)
From sequence to function…From sequence to function…
Methylmalanoyl CoA Decarboxylase Pattern [ILV]-x(3)-E-x(7)-V-[GA]-x-[IVL]-x-L-N-R-P mapped on the structure of 1DUB. Ball representation in pink shows the potential ligands and its binding pockets. The balls in blue represent the residues making up the motif on the known structure.
3-motif resource
The server seems to be down today!