automated prokaryotic annotation at jcvi
DESCRIPTION
Conference: Annual BRC Meeting (BRC6), Oct 28-29, 2008 in Ft. Lauderedale, Florida. Presenter: Dan HaftTRANSCRIPT
![Page 1: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/1.jpg)
AutomatedAutomatedProkaryotic Prokaryotic AnnotationAnnotation
at the JCVIat the JCVI
Danie l HaftDanie l Haft20082008
![Page 2: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/2.jpg)
A Dual-UsePipeline
Multiple types of stored evidence Persistent & Flexibly Interleaved Supports selective re-annotation Features annotation-driving databases
- CHAR- TIGRFAMs- Genome Properties- BrainGrab Rules
Evidence used by Machine and by Experts MANATEE interface for annotators Capture new rules with BrainGrab
![Page 3: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/3.jpg)
Computable objects:Output from one programbecomes input to another.
HMM results drive GenomeProperties
Genome Properties guide GOprocess assignments
GO process terms
![Page 4: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/4.jpg)
Identification of Genome FeaturesIdentification of Genome Features
IMMbuilt
Glimmer builds a statistical model from the training set
Genome Sequence
•• rRNA rRNA, tRNA, , tRNA, RfamRfam
•• IS elements ·Phage regions ·RepeatsIS elements ·Phage regions ·Repeats
ORFsORFs
: Other Genome Features: Other Genome Features
![Page 5: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/5.jpg)
Gene FindingGlimmer & friends, homology methods
Homology Searches (gathering evidence)
BLAST-Extend-ReprazeHidden Markov Modelsmisc.
Structural Curation ( ORF Management)
Auto_Gene_Curate (start sites, overlaps)InterEvidence
Functional AssignmentsAuto_AnnotateManualMapped
Data Availability
![Page 6: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/6.jpg)
Homology SearchesHomology Searches•• HMM searches: TIGRFAMs & PfamHMM searches: TIGRFAMs & Pfam•• BLAST searches: against internal NIAABLAST searches: against internal NIAA•• PROSITE motifsPROSITE motifs•• InterProInterPro•• TmHMMTmHMM•• SignalPSignalP•• LipoproteinLipoprotein•• PsortPsort•• Generate Paralogous FamiliesGenerate Paralogous Families•• Custom databases searchesCustom databases searches ( (TransportDBTransportDB, Rules), Rules)
![Page 7: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/7.jpg)
Gene Model CurationGene Model Curation
•• Overlaps resolved by evidence competitionOverlaps resolved by evidence competition
•• Start site Start site curationcuration
•• Missed genes / unsupported gene callsMissed genes / unsupported gene calls
![Page 8: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/8.jpg)
Evidence can Overhang the GeneEvidence can Overhang the GeneBlast-Extend-Blast-Extend-ReprazeRepraze (BER) (BER)
The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-framestop codons (PM). This is indicated when similarity extends outside the coordinates of the proteincoding sequence. Blue line indicates predicted protein coding seqeunce, green line indicates up-and downstream extensions. Red line is the match protein.
ORFxxxxx300 bp 300 bpend5 end3
search protein
match protein
similarity extends in the same frame through a stop codon
normal full length match
*similarity extends upstream through a start, or downstream past a frameshift
!
![Page 9: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/9.jpg)
Pfam vs. TIGRFAMs Functional assignments to
proteins Granularity tuned for
single-hit equivalogs(mono-functional !)
Generates computableobjects --> pathwayreconstructions
TIGRFAMs: RULES
Names for homologydomains in proteins
Granularity tuned fortwilight-level sequencesimilarity detection
Explains things toannotator
Pfam : Explanations
![Page 10: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/10.jpg)
TIGRFAMs equivalogsvs. Pfam domains
}X
X
X
Y
Z}
TIGRxxxxx
PFxxxxx
![Page 11: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/11.jpg)
TIGRFAMs as annotation rules
EC number computable ! GO term computable ! protein name computable ? HMM hit computable !!
![Page 12: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/12.jpg)
Isology (homology) types:ranking our rules
EXCEPTION additional info, e.g. “vegetative”
EQUIVALOG the SAME (in enough ways) toreceive the same name across multiple genomes,reflecting one specific function.
SUBFAMILY can name a whole class
DOMAIN class name for a protein region (and apply these classifications also to Pfam)
![Page 13: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/13.jpg)
CHAR : Experimentally CharacterizedCHAR : Experimentally Characterized Protein Database Protein Database
• Highly curated database of experimentally characterizedproteins; connects protein accessions, known function, and thescientific literature.
•What does it include:–Controlled vocabulary describes the type of experimentationperformed in each publication–Key annotation fields (protein name, gene symbol, EnzymeCommission (EC) number, taxonomic data, Gene Ontology (GO)terms) are extracted–Synonymous protein accessions obtained from publicdatabases (Genbank and UniProt) are stored
![Page 14: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/14.jpg)
Annotation Proceeds from …
Inside --> out (e.g. AutoAnnotate): for every protein Collect evidence Best-guess annotation
Outside --> in (e.g. TIGRFAMs): for every model
Search tool + cutoff + standards = annotation rule Achieves partial coverage
Hybrid (BrainGrab) for every unfinished protein Look for means to annotate: blastp, synteny, hole-filling, etc. Capture annotator logic as a new rule Add to library of rules/models for all future genomes
![Page 15: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/15.jpg)
Subject Genome
Trusted Complete Automatic
Proper Realm ofAnnotator Attention
RULES
BrainGrab
NEW
genome genome
share
validate IMPORT
![Page 16: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/16.jpg)
BLASTP_MATCH [SP|P07363, 1600, 95, 92, 60, 1]
SP|P07363|CHEA_ECOLI Chemotaxis protein cheA EC 2.7.13.3
EcHS_A1984 is manually annotated confidentlybecause it is similar enough to :
(method: defines “similar enough”)
Must be the only protein in genome that scores >= 1600 by blastp,covering >= 95 % of the length of the characterized protein and>= 92 % of the target protein, with >= 60 % sequence identity.
A Teachable Moment
![Page 17: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/17.jpg)
a sample of expert opinion:“For This Particular Protein Family”
I (D.H.H.) assert that any > 75 %-identical, full-length match is the same protein.
Ditto any > 65 % match, as long as the region isclearly syntenic.
Ditto any single-copy > 50 % match, as long as it fillsthis hole in this otherwise mostly complete pathway.
![Page 18: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/18.jpg)
B “Bag of Genes”
G Genome Properties
E Evidence to drive other programs
Image from Gödel, Escher, Bach:an Eternal Golden Braidby Douglas Hofstadter, 1979
![Page 19: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/19.jpg)
Genome Properties:annotation at the level of systems
pathway (glyoxylate shunt) system (type III secretion) structure (outer membrane) genometrics (GC content) phenotype (motility, pathogenesis)
YES someevidence
notsupported NO
![Page 20: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/20.jpg)
Some Novel Genome Properties
12 subtypes of CRISPR/Cas system
PEP-CTERM / exo-sortase: Biofilm-associatedprotein sorting
Type VI secretion (53 loci in B. mallei 23344)
Post-translational selenium-modified enzymes
Heterocycle-containing bacterial toxinproduction: BA_2677 = “heterocyclo-anthracin”
![Page 21: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/21.jpg)
A family of variable putative toxinswith patterns of CGG insertions.
![Page 22: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/22.jpg)
Future Annotation PipelineEnhancements
•• Populate the Characterized Protein DatabasePopulate the Characterized Protein Database
•• Develop META-RULES from CHARDevelop META-RULES from CHAR
•• BrainGrab BrainGrab for novel contentfor novel content
•• Import additional Import additional computable evidencecomputable evidence
•• ImproveImprove exchanges of validationexchanges of validation setssets
•• Build a protein names ontologyBuild a protein names ontology
![Page 23: Automated Prokaryotic Annotation at JCVI](https://reader034.vdocuments.site/reader034/viewer/2022042614/554e9065b4c905fc368b4bd3/html5/thumbnails/23.jpg)
AcknowlegementsRamana Madupu
Jeremy Selengut
Alex Richter
JCVI microbial annotation team