web apollo tutorial for the i5k copepod research community
DESCRIPTION
Introduction to Web Apollo for the i5K i5K copepod research community. WebApollo is genome annotation editor; it provides a web-based environment that allows multiple distributed users to review, edit, and share manual annotations. This presentation includes information specific to the projects of the Global Initiative to sequence the genomes of 5,000 species of arthropods, i5K. Let's get started!TRANSCRIPT
UNIVERSITY OF
CALIFORNIA
An introduction to Web Apollo.A webinar for the Eurytemora affinis research community.
Monica Munoz-Torres, PhD | @monimunozto
Berkeley Bioinformatics Open-Source Projects (BBOP)
Genomics Division, Lawrence Berkeley National Laboratory
29 August, 2014
UNIVERSITY OF
CALIFORNIA
Outline1. What is Web Apollo?:
• Definition & working concept.
2. Our Experience With Community
Based Curation.
3. The Manual Annotation Process.
4. Becoming acquainted with Web
Apollo.
5. Example.
An introduction to
Web Apollo.A webinar for the
Eurytemora affinis
research community.
Outline 3
During this webinar you will:
• Learn to identify homologs of known genes of interest
in your newly sequenced genome.
• Become familiar with the environment and
functionality of the Web Apollo genome annotation
editing tool.
Footer 4
What is Web Apollo?
• Web Apollo is a web-based, collaborative genomic
annotation editing platform.
We need annotation editing tools to modify and refine the precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.
51. What is Web Apollo?
Find more about Web Apollo at
http://GenomeArchitect.org
and
Genome Biol 14:R93. (2013).
Brief history of Apollo*:
a. Desktop:
one person at a time editing a
specific region, annotations
saved in local files; slowed down
collaboration.
b. Java Web Start:
users saved annotations directly
to a centralized database;
potential issues with stale
annotation data remained.
1. What is Web Apollo? 6
Biologists could finally visualize computational analyses and
experimental evidence from genomic features and build
manually-curated consensus gene structures. Apollo became a
very popular, open source tool (insects, fish, mammals, birds, etc.).
*
Web Apollo
• Browser-based tool integrated with JBrowse.
• Two new tracks: “Annotation” and “DNA Sequence”
• Allows for intuitive annotation creation and editing,
with gestures and pull-down menus to create and
modify transcripts and exons
structures, insert comments
(CV, freeform text), etc.
• Customizable look & feel.
• Edits in one client are
instantly pushed to all other
clients: Collaborative!
1. What is Web Apollo? 7
Working
Concept
In the context of gene manual annotation,
curation tries to find the best examples
and/or eliminate most errors.
To conduct manual annotation efforts:
Gather and evaluate all available evidence
using quality-control metrics to
corroborate or modify automated
annotation predictions.
Perform sequence similarity searches
(phylogenetic framework) and use
literature and public databases to:
• Predict functional assignments from
experimental data.
• Distinguish orthologs from paralogs,
and classify gene membership in
families and networks.
2. In our experience. 8
Automated gene models
Evidence:
cDNAs, HMM domain searches,
alignments with assemblies or
genes from other species.
Manual annotation & curation
Dispersed, community-based gene
manual annotation efforts.
We continuously train and support
hundreds of geographically dispersed
scientists from many research
communities, to perform biologically
supported manual annotations using
Web Apollo.
– Gate keepers and monitoring.
– Written tutorials.
– Training workshops and geneborees.
– Personalized user support.
2. In our experience. 9
What we have learned.
Harvesting expertise from dispersed researchers who
assigned functions to predicted and curated peptides
we have developed more interactive and
responsive tools, as well as better visualization,
editing, and analysis capabilities.
102. In our experience.
http://people.csail.mit.edu/fredo/PUBLI/Drawing/
Collaborative Efforts Improved
Automated Annotations
In many cases, automated annotations have been
improved (e.g: Apis mellifera. Elsik et al. BMC Genomics 2014, 15:86).
Also, learned of the challenges of newer sequencing
technologies, e.g.:
– Frameshifts and indel errors
– Split genes across scaffolds
– Highly repetitive sequences
To face these challenges, we train annotators in
recovering coding sequences in agreement with all
available biological evidence.
112. In our experience.
It is helpful to work together.
Scientific community efforts bring together domain-
specific and natural history expertise that would
otherwise remain disconnected.
Breaking down large amounts of data into
manageable portions and mobilizing groups
of researchers to extract the most accurate
representation of the biology from all
available data distills invaluable
knowledge from genome analysis.
122. In our experience.
Understanding the evolution of sociality
Comparing the genomes of 7 species of ants
contributed to a better understanding of the
evolution and organization of insect societies
at the molecular level.
Insights drawn mainly from six core aspects of
ant biology:
1. Alternative morphological castes
2. Division of labor
3. Chemical Communication
4. Alternative social organization
5. Social immunity
6. Mutualism
13
Libbrecht et al. 2012. Genome Biology 2013, 14:212
2. In our experience.
Atta cephalotes (above) and Harpegnathos saltator.
©alexanderwild.com
Groups of
communities
continue to guide
our efforts.
A little training goes a long way!
With the right tools, wet lab scientists make exceptional
curators who can easily learn to maximize the
generation of accurate, biologically supported gene
models.
142. In our experience.
Manual
Annotation
How do we get there?
15
AssemblyManual
annotation
Experimental
validation
Automated
Annotation
In a genome sequencing project…
3. How do we get there?
Gene Prediction
Identification of protein-coding genes, tRNAs, rRNAs,
regulatory motifs, repetitive elements (masked), etc.
- Ab initio (DNA composition): Augustus, GENSCAN,
geneid, fgenesh
- Homology-based: E.g: SGP2, fgenesh++
16
Nucleic Acids 2003 vol. 31 no. 13 3738-3741
3. How do we get there?
Gene Annotation
Integration of data from prediction tools to generate a
consensus set of predictions or gene models.
• Models may be organized using:
- automatic integration of predicted sets; e.g: GLEAN
- packaging necessary tools into pipeline; e.g: MAKER
• All available biological evidence (e.g. transcriptomes) further
informs the annotation process.
173. How do we get there?
In some cases algorithms and metrics used to generate
consensus sets may actually reduce the accuracy of the
gene’s representation; in such cases it is usually better to
use an ab initio model to create a new annotation.
Manual Genome Annotation
• Identifies elements that best represent the underlying
biology.
• Eliminates elements that reflect the systemic errors of
automated genome analyses.
• Determines functional roles through comparative
analysis of well-studied, phylogenetically similar
genome elements using literature, databases, and
the researcher’s experience.
183. How do we get there?
Curation Process: is Necessary
1. A computationally predicted consensus gene set is
generated using multiple lines of evidence.
2. Manual annotation takes place.
3. Ideally consensus computational predictions will be
integrated with manual annotations to produce an
updated Official Gene Set (OGS).
Otherwise, “incorrect and incomplete genome annotations
will poison every experiment that uses them”.
- M. Yandell.
193. How do we get there?
The Collaborative Curation Process at
i5K
1) A computationally predicted consensus gene set has
been generated using multiple lines of evidence; e.g.
Consensus Gene EAFF_v0.5.3-Models.
2) i5K Projects will integrate consensus computational
predictions with manual annotations to produce an updated
Official Gene Set (OGS):
» If it’s not on either track, it won’t make the OGS!
» If it’s there and it shouldn’t, it will still make the OGS!
203. How do we get there?
Consensus set: reference and start point
• In some cases algorithms and metrics used to generate
consensus sets may actually reduce the accuracy of the gene’s
representation; e.g. use Augustus model instead to create a new
annotation.
• Isoforms: drag original and alternatively spliced form to ‘User-
created Annotations’ area.
• If an annotation needs to be removed from the consensus set,
drag it to the ‘User-created Annotations’ area and label as
‘Delete’ on Information Editor.
• Overlapping interests? Collaborate to reach agreement.
• Follow guidelines for i5K Pilot Species Projects as shown at
http://goo.gl/LRu1VY
213. How do we get there?
Web Apollo
Sort
Web ApolloThe Sequence Selection Window
4. Becoming Acquainted with Web Apollo.
23
Navigation tools:
pan and zoom Search box: go
to a scaffold or
a gene model.
Grey bar of coordinates
indicates location. You can
also select here in order to
zoom to a sub-region.
‘View’: change
color by CDS,
toggle strands,
set highlight.
‘File’:
Upload your own
evidence: GFF3,
BAM, BigWig, VCF*.
Add combination
and sequence
search tracks.
‘Tools’:
Use BLAT to query the
genome with a protein
or DNA sequence.
Available Tracks
Evidence Tracks Area
‘User-created Annotations’ Track
Login
Web Apollo
Graphical User Interface (GUI) for editing annotations
4. Becoming Acquainted with Web Apollo.
Flags non-
canonical splice
sites.
Selection of features and
sub-features
Edge-matching
Evidence Tracks Area
‘User-created Annotations’ Track
The editing logic in the server:
selects longest ORF as CDS
flags non-canonical splice sites
Web Apollo
4. Becoming Acquainted with Web Apollo.
25
DNA Track
‘User-created Annotations’ Track
Web Apollo
4. Becoming Acquainted with Web Apollo.
There are two new kinds of tracks for:
annotation editing
sequence alteration editing
Web ApolloAnnotations, annotation edits, and History: stored in a centralized database.
4. Becoming Acquainted with Web Apollo.
Web Apollo
4. Becoming Acquainted with Web Apollo.
28
• DBXRefs
• PubMed IDs
• GO terms
• Comments
The Information Editor
Additional FunctionalityIn addition to protein-coding gene annotation that you know and love.
• Non-coding genes: ncRNAs, miRNAs, repeat regions, and TEs
• Sequence alterations (less coverage = more fragmentation)
• Visualization of stage and cell-type specific transcription data as coverage plots, heat maps, and alignments
4. Becoming Acquainted with Web Apollo.
To find the gene region you wish to annotate, you may use:
a) a protein sequence from another species
b) a sequence from a similar gene
c) on your own, you aligned your gene models or transcriptomic data to the genome.
d) you used high quality proteins and/or gene family alignments (multi or single
species) and are able to identify conserved domains.
How to begin curating
Option 1 – You have a sequence but don’t know where it is in this genome:• Use BLAT in Web Apollo window, or BLAST at NAL’s i5k BLAST server, available at:
http://i5k.nal.usda.gov/blastn
• Alternatively, use any other tool; for example Geneious.
Option 2 – The genome has already been annotated with your sequences and you have a gene
identifier that has been indexed in Web Apollo. • That is, you know where to look, so type the ID in the Search box of Web Apollo.
• Web Apollo autocompletes using a case-insensitive search anchored on the left-hand side of
the word. For example “HaGR” will show all “hagr” objects (up to 30).
• Choose one of the genes and click “Go”.
• You can do that with Domains, Alignments or Gene names provided to you (if they have been
indexed).
Option 3 – Find genes based on functional ontology terms or network membership identifiers.
1. Select the chromosomal region of interest, e.g. scaffold.
2. Select appropriate evidence tracks.
3. Determine whether a feature in an existing evidence track will
provide a reasonable gene model to start working.
- If yes: select and drag the feature to the ‘User-created
Annotations’ area, creating an initial gene model. If necessary
use editing functions to adjust the gene model.
- Nothing available to you? Let’s have a talk.
4. Check your edited gene model for integrity and accuracy by
comparing it with available homologs.
4. Becoming Acquainted with Web Apollo
General Process of Curation
31 |
Always remember: when annotating gene models using Web
Apollo, you are looking at a ‘frozen’ version of the genome
assembly and you will not be able to modify the assembly itself.
Example
Introductory demonstration using the Apis mellifera genome.
Q&A session using the Eurytemora affinis genome at
https://apollo.nal.usda.gov/euraff/selectTrack.jsp
Example 32
A public Honey Bee Web Apollo Demo is available at
http://genomearchitect.org/WebApolloDemo
What do we know for this species?
• What data are currently available?
• At NCBI:
• 5,570 nucleotide sequences scaffolds
• 446 amino acid sequences CO-I
• 0 conserved domains identified
• 0 “gene” entries submitted
Footer 33
PubMed Search: what’s new?
Footer 34
Empirical examples of
beneficial reversal of
dominance:
• Warfarin resistance: mutation
of VKORC1 is associated with
increased dietary requirement
for vit. K
How many sequences for your gene of
interest?
Footer 35
• VKORC1 – vit. K epoxide reductase
complex, subunit 1.
• MF: quinone binding (IEA,
GO:0048038), vit K epoxide reductase
activity (IDA, GO:0047057).
• BP: blood coagulation (IMP,
GO:0007596), bone development
(ISS,GO:0060348).
• CC: endoplasmic reticulum membrane
(TAS, GO:0005789), integral
component of membrane (IEA,
GO:0016021).
And what do we know about it?
BLAST at i5K https://i5k.nal.usda.gov/blast
Footer 36
To
Web
Ap
ollo
BLAST at i5K: hsps in “BLAST+ results” track
Footer 37
Available Tracks
Footer 38
Creating a new gene model: drag and drop
Footer 39
• Web Apollo automatically calculates the longest open reading
frame (ORF). In this case, the ORF includes the hsp.
Get Sequence
Footer 40
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Flanking sequences (other gene models) vs. NCBI nr
Footer 41
At 3’ end
At 5’ end
Additional evidence in support of split
Footer 42
Editing: split
Footer 43
Finished model
Footer 44
Information Editor
• DBXRefs: NP_076869.1, H. sapiens, RefSeq
• PubMed identifier: PMID:24337963
• Gene Ontology IDs: GO:0048038, GO:0047057, GO:0007596,
GO:0060348, GO:0005789, GO:0016021.
• Comments.
• Name, Symbol.
• Approve / Delete radio button.
Footer 45
Comments
Arthropodcentric Thanks!AgriPest Base
FlyBase
Hymenoptera Genome Database
VectorBase
Acromyrmex echinatior
Acyrthosiphon pisum
Apis mellifera
Atta cephalotes
Bombus terrestris
Camponotus floridanus
Helicoverpa armigera
Linepithema humile
Manduca sexta
Mayetiola destructor
Nasonia vitripennis
Pogonomyrmex barbatus
Solenopsis invicta
Tribolium castaneum… and you!
Thanks!
• Berkeley Bioinformatics Open-source Projects
(BBOP), Berkeley Lab: Web Apollo and Gene Ontology
teams. Suzanna E. Lewis (PI).
• Christine G. Elsik (PI). § University of Missouri.
• Ian Holmes (PI). University of California, Berkeley.
• Arthropod genomics community, i5K Steering
Committee, Monica Poelchau at USDA/NAL, fringy
Richards at HGSC-BCM, Alexie Papanicolaou at
CSIRO, Oliver Niehuis at 1KITE http://www.1kite.org/,
BGI, and the Honey Bee Genome Sequencing
Consortium.
• Web Apollo is supported by NIH grants
5R01GM080203 from NIGMS, and 5R01HG004483
from NHGRI, and by the Director, Office of Science,
Office of Basic Energy Sciences, of the U.S.
Department of Energy under Contract No. DE-AC02-
05CH11231.
• Insect images used with permission:
http://AlexanderWild.com and O. Niehuis.
• For your attention, thank you!
Thank you. 47
Web Apollo
Suzanna Lewis
Gregg Helt
Colin Diesh§
Deepak Unni§
Gene Ontology
Chris Mungall
Seth Carbon
Heiko Dietze
Colleagues at BBOP
Web Apollo: http://GenomeArchitect.org
GO: http://GeneOntology.org
i5K: http://arthropodgenomes.org/wiki/i5K