prokka - rapid bacterial genome annotation - abphm 2013
TRANSCRIPT
Rapid automatic microbial genome annotation
using Prokka
Dr Torsten Seemann
Applied Bioinformatics and Public Health Microbiology - Wed 15 May 2013 @ 1530h - Moller Centre, Cambridge, UK
Background
We come from a land down-under
The team
● Simon Gladmano VelvetOptimiser author, presenting Galaxy poster
● Paul Harrisono author of Nesoni toolkit
● David Powello author of VAGUE, software wizard, theoretician
● Dieter Bulacho sequence magician, closes genomes at will
... and we are recruiting.
History
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
?
?
?
?
?
?
?
?
Introduction
De novo assembly
Align reads to a reference
Process
De novo assembly
Ideally, one sequence per replicon.
Millions of short sequences
(reads)
A few long sequences
(contigs)
Reconstruct the original genome sequence from the sequence reads only
Annotation
Adding biological information to sequences.
ACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTAGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAAACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGTCCGTCCGTGGGCCACGGCCACCGCTTTTTTTTTTGCC
delta toxinPubMed: 15353161
ribosome binding site
transfer RNALeu-(UUR)
tandem repeatCCGT x 3
homopolymer10 x T
What's in an annotation?
● Locationo which sequence? chromosome 2o where on the sequence? 100..659o what strand? -ve
● Feature typeo what is it? protein
coding gene
● Attributeso protein product? alcohol
dehydrogenaseo enzyme code? EC:1.1.1.1o subcellular location? cytoplasmo note?
essential for ABPHM
Bacterial feature types
● protein coding geneso promoter (-10, -35)o ribosome binding site (RBS)o coding sequence (CDS)
signal peptide, protein domains, structureo terminator
● non coding geneso transfer RNA (tRNA)o ribosomal RNA (rRNA)o non-coding RNA (ncRNA)
● othero repeat patterns, operons, origin of replication, ...
Automatic annotation
Key bacterial features
● tRNAo easy to find and annotate: anti-codon
● rRNAo easy to find and annotate: 5s 16s 23s
● CDSo straightforward to find candidates
false positives are often small ORFs wrong start codon
o partial genes, remnantso pseudogenes o assigning function is the bulk of the workload
Automatic annotation
Two strategies for identifying coding genes:
● sequence alignment o find known protein sequences in the contigs
transfer the annotation acrosso will miss proteins not in your databaseo may miss partial proteins
● ab initio gene findingo find candidate open reading frames
build model of ribosome binding sites predict coding regions
o may choose the incorrect start codono may miss atypical genes, overpredict small genes
Some good existing tools
Software ab initio
align-ment Availability Speed
RAST yes yes web only 12-24 hours
xBASE yes no web only >8 hours
BG7 no yes standalone >10 hours
PGAAP(NCBI) yes yes email / we >1 month
Why another tool?
● Convenienceo I have sequence, just tell me what's in it, please.
● Speedo exploit multi-core computers (aim < 15min)
● Standards complianto GFF3/GBK for viewing, TBL/FSA for Genbank sub.
● Rich consistent trustworthy outputo /product /gene /EC_number
● Provenanceo a record of where/how/why is was annotated so
Why "Prokka" ?
● Unique in Google
● I like the letter "k"
● Easy to type
● It sounds Aussie
● Loosely fits "Prokaryotic Annotation"
● It rhymes with "Quokka" o Australian cat-sized nocturnal marsupial herbivoreo first Aussie mammal seen by Europeans - "giant rat"
Prokka pipeline (simplified)
tRNA
rRNA
ncRNA
CDS
FASTAcontigs
Infernal
RNAmmer
Prodigal SignalP
Aragorn
sig_peptide
protein domains
HMMER3
protein annotation
BLAST+
Rfam
Swiss Pfam TIGRUser
GFF3GBKASN1
OUT
What can you trust?
Predicting protein function
Sequence similarity is a proxy for homology
● Sequence based (alignment)o tools: BLAST, BLAT, FASTA, Exonerateo databases: RefSeq, Uniprot, ...
● Model based ("fuzzy sequence" matching)o PSSM: position specific scoring matrix
tools: RPS-BLAST, Psi-BLAST databases: CDD, COG, Smart
o HMM: hidden Markov models tools: HMMER, HHblits databases: Pfam, TIGRfams
Sequence databases
I'll just BLAST against the non-redundant database. -- Anonymous
● Which one?o nucleotide (nt) or protein (nr)
● It's actually quite redundanto only eliminates exact matching sequences
● It's not pickyo nearly anything is admitted, garbage in garbage out
● It's too bigo searching takes too long
Hierarchical searching
● Factso searching against smaller databases is fastero searching against similar sequences is faster
● Ideao start with small set of close proteinso advance to larger sets of more distant proteins
● Prokkao your own custom "trusted" set (optional)o core bacterial proteome (default)o genus specific proteome (optional)o whole protein HMMs: PRK clusters, TIGRfamso protein domain HMMs: Pfam
Core bacterial proteome
● Many bacterial proteins are conservedo experimentally validatedo small number of themo good annotations
● Prokka provides this databaseo derived from UniProt-Swissproto only bacterial proteinso only accept evidence level 1 (aa) or 2 (RNA) o reject "Fragment" entrieso extract /gene /EC_number /product /db_xref
● First step gets ~50% of the geneso BLAST+ blastp, multi-threading to use all CPUs
The remainder
● Prokka has genus specific databaseso aim to capture "genus specific" naming conventionso derived from proteins in completed genomeso proteins are clustered and majority annotation winso some annotations are rubbish though
● Custom model databaseso I took COG/PRK MSAs and made HMMs
● Existing model databaseso Pfam, TIGRfams are well curated
● And if all else failso we always have our friend "hypothetical protein"
Provenance
Provenance
Recording where an annotation came from.
Prokka uses Genbank "evidence qualifier" tags:
Wet lab/experiment="EXISTENCE:Northern blot"
Dry lab/inference="similar to DNA sequence:INSD:AACN010222672.1"/inference="profile:tRNAscan:2.1"/inference="protein motif:InterPro:IPR001900"/inference="ab initio prediction:Glimmer:3.0"
Example from Prokka
Feature Type:
tRNA
Location:contig000341 @ 655..730 +
Attributes:
/gene="tRNA-Leu(UUR)"
/anticodon=(pos:678..680,aa:Leu)
/product="transfer RNA-Leu(UUR)"
/inference="profile:Aragorn:1.2"
Software quality
Software goals
● Follow basic conventions o "prokka" should say something helpfulo "prokka -h" or "prokka --help" should show help
● All options should be optionalo "prokka contigs.fa" should do something useful
● Fail gracefullyo check your dependencies existo produce useful error messageso generate a log file (provenance!)
● Use standard input and output file formatso or at least tab-separated values if you insist...
Prokka in context
● Prokka is not o particularly originalo technically or algorithmically significanto foolproof to install some dependencieso for everyone
BUT
● Prokka o is an ongoing project which will only improve :-)o checks it will run properly before wasting your timeo does what it claims, and does it quicklyo is being used widely
Conclusions
Prokka in the wild
● Pathogen Informatics @ Sanger UKo Andrew Pageo 50,000 draft genomes in 2 weeks (24 sec each!)
● Austin Hospital @ Melbourne AUo Ben Howden - Dept Infectious Diseaseso assembly & annotation of MiSeq clinical isolates
● VBC @ Monash AUo assemble & annotate all of SRA
● Many moreo Public Health Agency of Canadao Some of you next week hopefully!
Planned features
● Modularityo alternate sub-tools eg. Aragorn vs tRNAscan-SEo every sub-system should be optionalo facilitate entry into Galaxy Toolshed
● Better support foro metagenome assemblies, viruses and archaeao broken genes, pseudogenes, assembly breakpoints
● Fastero smaller core databaseso better parallelisation and less disk i/o
● Prokka-Webo web server version currently in beta testing
Acknowledgements
● Organiserso Conference committee - for inviting meo Wellcome Trust - Laura Hubbard
● Original Prokka testerso Simon Gladman & Dieter Bulach (internal)o Tim Stinear & Scott Chandry (external)
● Fundingo VLSCI / LSCC o Monash University
● Familyo Naomi, Oskar, Zoe - for tolerating my absences
Contact
Email [email protected]
Twitter @torstenseemann
Blog TheGenomeFactory.blogspot.com
Web www.bioinformatics.net.au
Thank you.