p rotein domain/family db
DESCRIPTION
P rotein domain/family db. Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO) - PowerPoint PPT PresentationTRANSCRIPT
Protein domain/family db
• Secondary databases are the fruit of analyses of the sequences found in the primary sequence db
• Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO)
• Each of them uses a different method to detect if a protein belongs to a particular domain/family (patterns, profiles, HMM)
Protein domain/family
• Most proteins have « modular » structures• Estimation: ~ 3 domains / protein• Domains (conserved sequences or structures) are identified by
multiple sequence alignments
• Domains can be defined by different methods: – Pattern (regular expression); used for very conserved domains– Profiles (weighted matrices): two-dimensional tables of position specific match-, gap-, and
insertion-scores, derived from aligned sequence families; used for less conserved domains– Hidden Markov Model (HMM); probabilistic models; an other method to generate profiles.
Some statistics• 15 most common domains for H. sapiens (Incomplete)
Immunoglobulin and major histocompatibility complex domain
Zinc finger, C2H2 typeEukaryotic protein kinaseRhodopsin-like GPCR superfamilyPleckstrin homology (PH) domainZinc finger, RING typeSrc homology 3 (SH3) domainRNA-binding region RNP-1 (RNA recognition motif)EF-hand familyHomeobox domainKrab boxPDZ domain (also known as DHR or GLGF)Fibronectin type III domainEGF-like domainCadherin domain…
http://www.ebi.ac.uk/proteome/HUMAN/interpro/top15d.html
Protein domain/family db
PROSITE Patterns /ProfilesProDom Aligned motifsPRINTS Aligned motifsPfam HMM (Hidden Markov Models)
SMART HMMBLOCKS Aligned motifs
InterPro
Prosite
Created in 1988 (SIB) Contains functional domains fully annotated, based on two methods:
patterns and profiles
Entries are deposited in PROSITE in two distinct files: Pattern/profiles with the list of all matches in SWISS-PROT Documentation
Aug 2001: contains 1089 documentation entries that describe 1474 different patterns, rules and profiles/matrices.
Diagnostic performance
List of matches
Prosite (profile): example
PFAM (HMMs): an entry
…
…
PFAM (HMMs): query output
HMMs
Most protein families are characterized by several conserved motifs Fingerprint: set of motif(s) (simple or composite, such as multidomains) = signature of family membership True family members exhibit all elements of the fingerprint, while subfamily members may possess only part of it
ProDom• consists of an automated compilation of
homologous domain alignment.
• August 2001: 390 ProDom families were generated automatically using PSI-BLAST. built from non fragmentary sequences from SWISS-PROT 39 + TREMBL - May 29th, 2000
ProDom: query output example
Your query
Protein domain/family: Composite databases
Example: InterPro
• Unification of PROSITE, PRINTS, Pfam, ProDom and SMART into an integrated resource of protein families, domains and functional sites;
• Single set of documents linked to the various methods;• Will be used to improve the functional annotation of
SWISS-PROT (classification of unknown protein…)
• This release (3.2 july 2001) contains 3939 entries, representing 1009 domains, 2850 families, 65 repeats and 15 post-translational modifications sites.