prokka - rapid bacterial genome annotation - abphm 2013

Rapid automatic microbial genome annotation

using Prokka

Dr Torsten Seemann

Applied Bioinformatics and Public Health Microbiology - Wed 15 May 2013 @ 1530h - Moller Centre, Cambridge, UK

Background

We come from a land down-under

The team

● Simon Gladmano VelvetOptimiser author, presenting Galaxy poster

● Paul Harrisono author of Nesoni toolkit

● David Powello author of VAGUE, software wizard, theoretician

● Dieter Bulacho sequence magician, closes genomes at will

... and we are recruiting.

History

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

?

?

?

?

?

?

?

?

Introduction

De novo assembly

Align reads to a reference

Process

De novo assembly

Ideally, one sequence per replicon.

Millions of short sequences

(reads)

A few long sequences

(contigs)

Reconstruct the original genome sequence from the sequence reads only

Annotation

Adding biological information to sequences.

ACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTAGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAAACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGTCCGTCCGTGGGCCACGGCCACCGCTTTTTTTTTTGCC

delta toxinPubMed: 15353161

ribosome binding site

transfer RNALeu-(UUR)

tandem repeatCCGT x 3

homopolymer10 x T

What's in an annotation?

● Locationo which sequence? chromosome 2o where on the sequence? 100..659o what strand? -ve

● Feature typeo what is it? protein

coding gene

● Attributeso protein product? alcohol

dehydrogenaseo enzyme code? EC:1.1.1.1o subcellular location? cytoplasmo note?

essential for ABPHM

Bacterial feature types

● protein coding geneso promoter (-10, -35)o ribosome binding site (RBS)o coding sequence (CDS)

signal peptide, protein domains, structureo terminator

● non coding geneso transfer RNA (tRNA)o ribosomal RNA (rRNA)o non-coding RNA (ncRNA)

● othero repeat patterns, operons, origin of replication, ...

Automatic annotation

Key bacterial features

● tRNAo easy to find and annotate: anti-codon

● rRNAo easy to find and annotate: 5s 16s 23s

● CDSo straightforward to find candidates

false positives are often small ORFs wrong start codon

o partial genes, remnantso pseudogenes o assigning function is the bulk of the workload

Automatic annotation

Two strategies for identifying coding genes:

● sequence alignment o find known protein sequences in the contigs

transfer the annotation acrosso will miss proteins not in your databaseo may miss partial proteins

● ab initio gene findingo find candidate open reading frames

build model of ribosome binding sites predict coding regions

o may choose the incorrect start codono may miss atypical genes, overpredict small genes

Some good existing tools

Software ab initio

align-ment Availability Speed

RAST yes yes web only 12-24 hours

xBASE yes no web only >8 hours

BG7 no yes standalone >10 hours

PGAAP(NCBI) yes yes email / we >1 month

Why another tool?

● Convenienceo I have sequence, just tell me what's in it, please.

● Speedo exploit multi-core computers (aim < 15min)

● Standards complianto GFF3/GBK for viewing, TBL/FSA for Genbank sub.

● Rich consistent trustworthy outputo /product /gene /EC_number

● Provenanceo a record of where/how/why is was annotated so

Why "Prokka" ?

● Unique in Google

● I like the letter "k"

● Easy to type

● It sounds Aussie

● Loosely fits "Prokaryotic Annotation"

● It rhymes with "Quokka" o Australian cat-sized nocturnal marsupial herbivoreo first Aussie mammal seen by Europeans - "giant rat"

Prokka pipeline (simplified)

tRNA

rRNA

ncRNA

CDS

FASTAcontigs

Infernal

RNAmmer

Prodigal SignalP

Aragorn

sig_peptide

protein domains

HMMER3

protein annotation

BLAST+

Rfam

Swiss Pfam TIGRUser

GFF3GBKASN1

OUT

What can you trust?

Predicting protein function

Sequence similarity is a proxy for homology

● Sequence based (alignment)o tools: BLAST, BLAT, FASTA, Exonerateo databases: RefSeq, Uniprot, ...

● Model based ("fuzzy sequence" matching)o PSSM: position specific scoring matrix

tools: RPS-BLAST, Psi-BLAST databases: CDD, COG, Smart

o HMM: hidden Markov models tools: HMMER, HHblits databases: Pfam, TIGRfams

Sequence databases

I'll just BLAST against the non-redundant database. -- Anonymous

● Which one?o nucleotide (nt) or protein (nr)

● It's actually quite redundanto only eliminates exact matching sequences

● It's not pickyo nearly anything is admitted, garbage in garbage out

● It's too bigo searching takes too long

Hierarchical searching

● Factso searching against smaller databases is fastero searching against similar sequences is faster

● Ideao start with small set of close proteinso advance to larger sets of more distant proteins

● Prokkao your own custom "trusted" set (optional)o core bacterial proteome (default)o genus specific proteome (optional)o whole protein HMMs: PRK clusters, TIGRfamso protein domain HMMs: Pfam

Core bacterial proteome

● Many bacterial proteins are conservedo experimentally validatedo small number of themo good annotations

● Prokka provides this databaseo derived from UniProt-Swissproto only bacterial proteinso only accept evidence level 1 (aa) or 2 (RNA) o reject "Fragment" entrieso extract /gene /EC_number /product /db_xref

● First step gets ~50% of the geneso BLAST+ blastp, multi-threading to use all CPUs

The remainder

● Prokka has genus specific databaseso aim to capture "genus specific" naming conventionso derived from proteins in completed genomeso proteins are clustered and majority annotation winso some annotations are rubbish though

● Custom model databaseso I took COG/PRK MSAs and made HMMs

● Existing model databaseso Pfam, TIGRfams are well curated

● And if all else failso we always have our friend "hypothetical protein"

Provenance

Provenance

Recording where an annotation came from.

Prokka uses Genbank "evidence qualifier" tags:

Wet lab/experiment="EXISTENCE:Northern blot"

Dry lab/inference="similar to DNA sequence:INSD:AACN010222672.1"/inference="profile:tRNAscan:2.1"/inference="protein motif:InterPro:IPR001900"/inference="ab initio prediction:Glimmer:3.0"

Example from Prokka

Feature Type:

tRNA

Location:contig000341 @ 655..730 +

Attributes:

/gene="tRNA-Leu(UUR)"

/anticodon=(pos:678..680,aa:Leu)

/product="transfer RNA-Leu(UUR)"

/inference="profile:Aragorn:1.2"

Software quality

Software goals

● Follow basic conventions o "prokka" should say something helpfulo "prokka -h" or "prokka --help" should show help

● All options should be optionalo "prokka contigs.fa" should do something useful

● Fail gracefullyo check your dependencies existo produce useful error messageso generate a log file (provenance!)

● Use standard input and output file formatso or at least tab-separated values if you insist...

Prokka in context

● Prokka is not o particularly originalo technically or algorithmically significanto foolproof to install some dependencieso for everyone

BUT

● Prokka o is an ongoing project which will only improve :-)o checks it will run properly before wasting your timeo does what it claims, and does it quicklyo is being used widely

Conclusions

Prokka in the wild

● Pathogen Informatics @ Sanger UKo Andrew Pageo 50,000 draft genomes in 2 weeks (24 sec each!)

● Austin Hospital @ Melbourne AUo Ben Howden - Dept Infectious Diseaseso assembly & annotation of MiSeq clinical isolates

● VBC @ Monash AUo assemble & annotate all of SRA

● Many moreo Public Health Agency of Canadao Some of you next week hopefully!

Planned features

● Modularityo alternate sub-tools eg. Aragorn vs tRNAscan-SEo every sub-system should be optionalo facilitate entry into Galaxy Toolshed

● Better support foro metagenome assemblies, viruses and archaeao broken genes, pseudogenes, assembly breakpoints

● Fastero smaller core databaseso better parallelisation and less disk i/o

● Prokka-Webo web server version currently in beta testing

Acknowledgements

● Organiserso Conference committee - for inviting meo Wellcome Trust - Laura Hubbard

● Original Prokka testerso Simon Gladman & Dieter Bulach (internal)o Tim Stinear & Scott Chandry (external)

● Fundingo VLSCI / LSCC o Monash University

● Familyo Naomi, Oskar, Zoe - for tolerating my absences

Contact

Email [email protected]

Twitter @torstenseemann

Blog TheGenomeFactory.blogspot.com

Web www.bioinformatics.net.au

Thank you.

prokka - rapid bacterial genome annotation - abphm 2013

Science

sequence alignment

locationwhich sequence

original genome sequence

coding regionsmay

small genes

atypical genes

annotation acrosswill

protein domains