web apollo tutorial for the i5k copepod research community

UNIVERSITY OF

CALIFORNIA

An introduction to Web Apollo.A webinar for the Eurytemora affinis research community.

Monica Munoz-Torres, PhD | @monimunozto

Berkeley Bioinformatics Open-Source Projects (BBOP)

Genomics Division, Lawrence Berkeley National Laboratory

29 August, 2014

UNIVERSITY OF

CALIFORNIA

Outline1. What is Web Apollo?:

• Definition & working concept.

2. Our Experience With Community

Based Curation.

3. The Manual Annotation Process.

4. Becoming acquainted with Web

Apollo.

5. Example.

An introduction to

Web Apollo.A webinar for the

Eurytemora affinis

research community.

Outline 3

During this webinar you will:

• Learn to identify homologs of known genes of interest

in your newly sequenced genome.

• Become familiar with the environment and

functionality of the Web Apollo genome annotation

editing tool.

Footer 4

What is Web Apollo?

• Web Apollo is a web-based, collaborative genomic

annotation editing platform.

We need annotation editing tools to modify and refine the precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.

51. What is Web Apollo?

Find more about Web Apollo at

http://GenomeArchitect.org

and

Genome Biol 14:R93. (2013).


Brief history of Apollo*:

a. Desktop:

one person at a time editing a

specific region, annotations

saved in local files; slowed down

collaboration.

b. Java Web Start:

users saved annotations directly

to a centralized database;

potential issues with stale

annotation data remained.

1. What is Web Apollo? 6

Biologists could finally visualize computational analyses and

experimental evidence from genomic features and build

manually-curated consensus gene structures. Apollo became a

very popular, open source tool (insects, fish, mammals, birds, etc.).

*

Web Apollo

• Browser-based tool integrated with JBrowse.

• Two new tracks: “Annotation” and “DNA Sequence”

• Allows for intuitive annotation creation and editing,

with gestures and pull-down menus to create and

modify transcripts and exons

structures, insert comments

(CV, freeform text), etc.

• Customizable look & feel.

• Edits in one client are

instantly pushed to all other

clients: Collaborative!

1. What is Web Apollo? 7

Working

Concept

In the context of gene manual annotation,

curation tries to find the best examples

and/or eliminate most errors.

To conduct manual annotation efforts:

Gather and evaluate all available evidence

using quality-control metrics to

corroborate or modify automated

annotation predictions.

Perform sequence similarity searches

(phylogenetic framework) and use

literature and public databases to:

• Predict functional assignments from

experimental data.

• Distinguish orthologs from paralogs,

and classify gene membership in

families and networks.

2. In our experience. 8

Automated gene models

Evidence:

cDNAs, HMM domain searches,

alignments with assemblies or

genes from other species.

Manual annotation & curation

Dispersed, community-based gene

manual annotation efforts.

We continuously train and support

hundreds of geographically dispersed

scientists from many research

communities, to perform biologically

supported manual annotations using

Web Apollo.

– Gate keepers and monitoring.

– Written tutorials.

– Training workshops and geneborees.

– Personalized user support.

2. In our experience. 9

What we have learned.

Harvesting expertise from dispersed researchers who

assigned functions to predicted and curated peptides

we have developed more interactive and

responsive tools, as well as better visualization,

editing, and analysis capabilities.

102. In our experience.

http://people.csail.mit.edu/fredo/PUBLI/Drawing/

Collaborative Efforts Improved

Automated Annotations

In many cases, automated annotations have been

improved (e.g: Apis mellifera. Elsik et al. BMC Genomics 2014, 15:86).

Also, learned of the challenges of newer sequencing

technologies, e.g.:

– Frameshifts and indel errors

– Split genes across scaffolds

– Highly repetitive sequences

To face these challenges, we train annotators in

recovering coding sequences in agreement with all

available biological evidence.


It is helpful to work together.

Scientific community efforts bring together domain-

specific and natural history expertise that would

otherwise remain disconnected.

Breaking down large amounts of data into

manageable portions and mobilizing groups

of researchers to extract the most accurate

representation of the biology from all

available data distills invaluable

knowledge from genome analysis.


Understanding the evolution of sociality

Comparing the genomes of 7 species of ants

contributed to a better understanding of the

evolution and organization of insect societies

at the molecular level.

Insights drawn mainly from six core aspects of

ant biology:

1. Alternative morphological castes

2. Division of labor

3. Chemical Communication

4. Alternative social organization

5. Social immunity

6. Mutualism

13

Libbrecht et al. 2012. Genome Biology 2013, 14:212


Atta cephalotes (above) and Harpegnathos saltator.

©alexanderwild.com

Groups of

communities

continue to guide

our efforts.

A little training goes a long way!

With the right tools, wet lab scientists make exceptional

curators who can easily learn to maximize the

generation of accurate, biologically supported gene

models.


Manual

Annotation

How do we get there?

15

AssemblyManual

annotation

Experimental

validation

Automated

Annotation

In a genome sequencing project…

3. How do we get there?

Gene Prediction

Identification of protein-coding genes, tRNAs, rRNAs,

regulatory motifs, repetitive elements (masked), etc.

- Ab initio (DNA composition): Augustus, GENSCAN,

geneid, fgenesh

- Homology-based: E.g: SGP2, fgenesh++

16

Nucleic Acids 2003 vol. 31 no. 13 3738-3741


Gene Annotation

Integration of data from prediction tools to generate a

consensus set of predictions or gene models.

• Models may be organized using:

- automatic integration of predicted sets; e.g: GLEAN

- packaging necessary tools into pipeline; e.g: MAKER

• All available biological evidence (e.g. transcriptomes) further

informs the annotation process.


In some cases algorithms and metrics used to generate

consensus sets may actually reduce the accuracy of the

gene’s representation; in such cases it is usually better to

use an ab initio model to create a new annotation.

Manual Genome Annotation

• Identifies elements that best represent the underlying

biology.

• Eliminates elements that reflect the systemic errors of

automated genome analyses.

• Determines functional roles through comparative

analysis of well-studied, phylogenetically similar

genome elements using literature, databases, and

the researcher’s experience.


Curation Process: is Necessary

1. A computationally predicted consensus gene set is

generated using multiple lines of evidence.

2. Manual annotation takes place.

3. Ideally consensus computational predictions will be

integrated with manual annotations to produce an

updated Official Gene Set (OGS).

Otherwise, “incorrect and incomplete genome annotations

will poison every experiment that uses them”.

- M. Yandell.


The Collaborative Curation Process at

i5K

1) A computationally predicted consensus gene set has

been generated using multiple lines of evidence; e.g.

Consensus Gene EAFF_v0.5.3-Models.

2) i5K Projects will integrate consensus computational

predictions with manual annotations to produce an updated

Official Gene Set (OGS):

» If it’s not on either track, it won’t make the OGS!

» If it’s there and it shouldn’t, it will still make the OGS!


Consensus set: reference and start point

• In some cases algorithms and metrics used to generate

consensus sets may actually reduce the accuracy of the gene’s

representation; e.g. use Augustus model instead to create a new

annotation.

• Isoforms: drag original and alternatively spliced form to ‘User-

created Annotations’ area.

• If an annotation needs to be removed from the consensus set,

drag it to the ‘User-created Annotations’ area and label as

‘Delete’ on Information Editor.

• Overlapping interests? Collaborate to reach agreement.

• Follow guidelines for i5K Pilot Species Projects as shown at

http://goo.gl/LRu1VY


http://goo.gl/LRu1VY

Web Apollo

Sort

Web ApolloThe Sequence Selection Window

4. Becoming Acquainted with Web Apollo.

23

Navigation tools:

pan and zoom Search box: go

to a scaffold or

a gene model.

Grey bar of coordinates

indicates location. You can

also select here in order to

zoom to a sub-region.

‘View’: change

color by CDS,

toggle strands,

set highlight.

‘File’:

Upload your own

evidence: GFF3,

BAM, BigWig, VCF*.

Add combination

and sequence

search tracks.

‘Tools’:

Use BLAT to query the

genome with a protein

or DNA sequence.

Available Tracks

Evidence Tracks Area

‘User-created Annotations’ Track

Login

Web Apollo

Graphical User Interface (GUI) for editing annotations


Flags non-

canonical splice

sites.

Selection of features and

sub-features

Edge-matching

Evidence Tracks Area


The editing logic in the server:

selects longest ORF as CDS

flags non-canonical splice sites

Web Apollo


25

DNA Track


Web Apollo


There are two new kinds of tracks for:

annotation editing

sequence alteration editing

Web ApolloAnnotations, annotation edits, and History: stored in a centralized database.


Web Apollo


28

• DBXRefs

• PubMed IDs

• GO terms

• Comments

The Information Editor

Additional FunctionalityIn addition to protein-coding gene annotation that you know and love.

• Non-coding genes: ncRNAs, miRNAs, repeat regions, and TEs

• Sequence alterations (less coverage = more fragmentation)

• Visualization of stage and cell-type specific transcription data as coverage plots, heat maps, and alignments


To find the gene region you wish to annotate, you may use:

a) a protein sequence from another species

b) a sequence from a similar gene

c) on your own, you aligned your gene models or transcriptomic data to the genome.

d) you used high quality proteins and/or gene family alignments (multi or single

species) and are able to identify conserved domains.

How to begin curating

Option 1 – You have a sequence but don’t know where it is in this genome:• Use BLAT in Web Apollo window, or BLAST at NAL’s i5k BLAST server, available at:

http://i5k.nal.usda.gov/blastn

• Alternatively, use any other tool; for example Geneious.

Option 2 – The genome has already been annotated with your sequences and you have a gene

identifier that has been indexed in Web Apollo. • That is, you know where to look, so type the ID in the Search box of Web Apollo.

• Web Apollo autocompletes using a case-insensitive search anchored on the left-hand side of

the word. For example “HaGR” will show all “hagr” objects (up to 30).

• Choose one of the genes and click “Go”.

• You can do that with Domains, Alignments or Gene names provided to you (if they have been

indexed).

Option 3 – Find genes based on functional ontology terms or network membership identifiers.

http://i5k.nal.usda.gov/blastn

1. Select the chromosomal region of interest, e.g. scaffold.

2. Select appropriate evidence tracks.

3. Determine whether a feature in an existing evidence track will

provide a reasonable gene model to start working.

- If yes: select and drag the feature to the ‘User-created

Annotations’ area, creating an initial gene model. If necessary

use editing functions to adjust the gene model.

- Nothing available to you? Let’s have a talk.

4. Check your edited gene model for integrity and accuracy by

comparing it with available homologs.

4. Becoming Acquainted with Web Apollo

General Process of Curation

31 |

Always remember: when annotating gene models using Web

Apollo, you are looking at a ‘frozen’ version of the genome

assembly and you will not be able to modify the assembly itself.

Example

Introductory demonstration using the Apis mellifera genome.

Q&A session using the Eurytemora affinis genome at

https://apollo.nal.usda.gov/euraff/selectTrack.jsp

Example 32

A public Honey Bee Web Apollo Demo is available at

http://genomearchitect.org/WebApolloDemo

What do we know for this species?

• What data are currently available?

• At NCBI:

• 5,570 nucleotide sequences scaffolds

• 446 amino acid sequences CO-I

• 0 conserved domains identified

• 0 “gene” entries submitted

Footer 33

PubMed Search: what’s new?

Footer 34

Empirical examples of

beneficial reversal of

dominance:

• Warfarin resistance: mutation

of VKORC1 is associated with

increased dietary requirement

for vit. K

How many sequences for your gene of

interest?

Footer 35

• VKORC1 – vit. K epoxide reductase

complex, subunit 1.

• MF: quinone binding (IEA,

GO:0048038), vit K epoxide reductase

activity (IDA, GO:0047057).

• BP: blood coagulation (IMP,

GO:0007596), bone development

(ISS,GO:0060348).

• CC: endoplasmic reticulum membrane

(TAS, GO:0005789), integral

component of membrane (IEA,

GO:0016021).

And what do we know about it?

BLAST at i5K https://i5k.nal.usda.gov/blast

Footer 36

To

Web

Ap

ollo

BLAST at i5K: hsps in “BLAST+ results” track

Footer 37

Available Tracks

Footer 38

Creating a new gene model: drag and drop

Footer 39

• Web Apollo automatically calculates the longest open reading

frame (ORF). In this case, the ORF includes the hsp.

Get Sequence

Footer 40

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Flanking sequences (other gene models) vs. NCBI nr

Footer 41

At 3’ end

At 5’ end

Additional evidence in support of split

Footer 42

Editing: split

Footer 43

Finished model

Footer 44

Information Editor

• DBXRefs: NP_076869.1, H. sapiens, RefSeq

• PubMed identifier: PMID:24337963

• Gene Ontology IDs: GO:0048038, GO:0047057, GO:0007596,

GO:0060348, GO:0005789, GO:0016021.

• Comments.

• Name, Symbol.

• Approve / Delete radio button.

Footer 45

Comments

Arthropodcentric Thanks!AgriPest Base

FlyBase

Hymenoptera Genome Database

VectorBase

Acromyrmex echinatior

Acyrthosiphon pisum

Apis mellifera

Atta cephalotes

Bombus terrestris

Camponotus floridanus

Helicoverpa armigera

Linepithema humile

Manduca sexta

Mayetiola destructor

Nasonia vitripennis

Pogonomyrmex barbatus

Solenopsis invicta

Tribolium castaneum… and you!

Thanks!

• Berkeley Bioinformatics Open-source Projects

(BBOP), Berkeley Lab: Web Apollo and Gene Ontology

teams. Suzanna E. Lewis (PI).

• Christine G. Elsik (PI). § University of Missouri.

• Ian Holmes (PI). University of California, Berkeley.

• Arthropod genomics community, i5K Steering

Committee, Monica Poelchau at USDA/NAL, fringy

Richards at HGSC-BCM, Alexie Papanicolaou at

CSIRO, Oliver Niehuis at 1KITE http://www.1kite.org/,

BGI, and the Honey Bee Genome Sequencing

Consortium.

• Web Apollo is supported by NIH grants

5R01GM080203 from NIGMS, and 5R01HG004483

from NHGRI, and by the Director, Office of Science,

Office of Basic Energy Sciences, of the U.S.

Department of Energy under Contract No. DE-AC02-

05CH11231.

• Insect images used with permission:

http://AlexanderWild.com and O. Niehuis.

• For your attention, thank you!

Thank you. 47

Web Apollo

Suzanna Lewis

Gregg Helt

Colin Diesh§

Deepak Unni§

Gene Ontology

Chris Mungall

Seth Carbon

Heiko Dietze

Colleagues at BBOP

Web Apollo: http://GenomeArchitect.org

GO: http://GeneOntology.org

i5K: http://arthropodgenomes.org/wiki/i5K

http://www.1kite.org/

http://AlexanderWild.com


http://GeneOntology.org

http://arthropodgenomes.org/wiki/i5K

web apollo tutorial for the i5k copepod research community

Education