the new vectorbase: our improved resource for invertebrate vectors scott emrich on behalf of...
TRANSCRIPT
The new VectorBase: our improved resource for invertebrate vectors
Scott EmrichOn behalf of VectorBase
“bigger, better, faster”Or
“"consolidate, improve and rationalise” (UK)
Ful
l rel
ease
Pre
-rel
ease
d*
Organism pages Raw GenBank datafrom sequencing centers
VectorBase has been mostly a collator of genomes
*
3
**
*
(our) Annotation
Rapid growth, however, in past 5 years
6
20022003
20042005
20062007
20082009
20102011
20122013
20140
10
20
30
40
50
60
Genome Proteomics Strains
#
VectorBase is also:
• A service providing tools for browsing and mining vector “-omics” data
• A content generator– Mostly genome annotation (later talks)
• Committed to regular releases (5-6 per year)
• A help desk to help our community on genome informatics and are responsible for facilitating data submission
4
In the end, VectorBase is a team
5And YOU!
Left side: • Welcome message• Available data• Tools and
Resources
Right side:• Past jobs• Organisms (2)• Latest news 7
Left side:• Community
Right side:• Rotating tips• Newsletters• Upcoming meetings
8
11
This is the new organism page:Collects strain, data, and relevant tools
~3700-8300 jobs per month
Mostly Anopheles but other species
Web development goals (2015)
• Patching/ upgrading webApollo instances (1)– multiple genomes in one instance– reworked framework to improve performance
• Integrating subcontractor work with Drupal CMS (2)– Easier releases and better cross site development
• Sitewide authentication for single user accounts– Drupal– Web Apollo– Galaxy
Modifying webApollo
example
Advanced Search
Antelmo (ND) is making Advanced Search more stable and intuitive via Drupal and SOLR-> Also allows looking at saved search, for advanced analysis of BRC usage-> Now running 4.x SOLR to further support PopBio
Current VectorBase variation + PopBio dataflows.
VCF
ISA-TAB
Sample +variationsetids
Ensemblvariationdatabase
PopBio
Display of variant data in genomic context
Display of detailed sample metadata, e.g. geodata
Use of Apache Solr to provide unified search (and thus integration)
across the BRC
VCF
Ensemblvariationdatabase
PopBio
Display of variant data in genomic context
Display of detailed sample metadata, e.g. geodata
ISA-TAB
PopBio import
• Current size: 121 projects, 57637 samples, 172,636 assays (of which 4,387 are IR)
• At present loading can be done overnight, but this may change
• Web interface is not slow due to “pre loading,” which definitely isn’t scalable
PopBio plansMap interface: delivery June release + Kolymbari + ICEMR
meetings
Spreadsheet submission wizard development scheduled for Fall 2015.
Year 2: Sample x genotype browser development, including e! REST and variation Solr work.
Year 2: Refactor project pages with scalable (but still flexible) data transfer (probably also Solr-driven) & update graphics.
Scaling up to millions of SNPs, thousands of samples
Plan to develop or modify something similar to MalariaGen's Panoptes with richer/more flexible metadata capabilities:
Upcoming genome updates
• June 2015– sandflies x 2– anopheles assembly updates x 4
• Summer 2015 - QC of Glossina workshop data, 16G data
• August 2015 – Release of malariaGen 1000G data (pending publication plans); we expect ~50 million new malaria mosquito variants by the end of summer.
• October - Glossinas x 6
Updating genes and assemblies• We recently supported the Glossina gene annotation workshop held in
Kenya (3/2015). The workshop data will be integrated into the existing Glossina databases for release in late 2015. A new database for the final species (Glossina palpalis) will also be created for release in late 2015.
• Assembly updates for An. farauti, An. melas, An. merus and An. sinensis
have been examined to assess whether we can project gene information onto the new assemblies. Over 90% of transcripts could be projected and we intend to schedule the assembly updates for Q3 2015.
• New databases have been proposed for Sarcoptes scabiei var canis, and Aedes albopictus.
• Emrich, Hahn, Lawnziak and Besansky will submit a new reference genome of An. gambiae (S) for summer 2015.
Improved EBI production• Data management systems
Webapollo databases have been set up for 32 organisms, and are being actively used by the community for Biomphalaria glabrata (snail), Phlebotamus papatasi and Lutzomyia longipalpis (sandflies), Musca domesticus (house fly) and the five
current Glossina (tsete) species.
• IT infrastructureVectorBase production pipelines are being migrated to the EBI eHive system
(https://github.com/Ensembl/ensembl-hive). This encourages standardization of our code base, and also allows using EBI parallel computing resources.
• Analysis toolsNew pipelines for xrefs, search, protein alignment and exonerate based
sequence alignments have been developed using the eHive system. This has allowed us to speed up run times in addition to the advantages above.
Future production work at EBISearch
• We had previously experienced scaling problems with the generation of Solr indices for the VectorBase search, and have now rewritten the core gene Solr gene index generation for eHive.
Updating genome data
• Projection of gene descriptions between closely related orthologs will be introduced in an attempt to improve basal gene annotation in some of the new species. First deployment of this code is scheduled for June 2015.
• Transcript, genomic sequence and GTF/GFF dumping have been included in the eHivr
pipeline, but data files are still updated on the VectorBase drupal site in a manual fashion.
• Adding the UCSC track hub system to facilitate metadata and additional “-omics” data