the new vectorbase: our improved resource for invertebrate vectors scott emrich on behalf of...

The new VectorBase: our improved resource for invertebrate vectors

Scott EmrichOn behalf of VectorBase

“bigger, better, faster”Or

“"consolidate, improve and rationalise” (UK)

Ful

l rel

ease

Pre

-rel

ease

d*

Organism pages Raw GenBank datafrom sequencing centers

VectorBase has been mostly a collator of genomes

*

3

**

*

(our) Annotation

Rapid growth, however, in past 5 years

6

20022003

20042005

20062007

20082009

20102011

20122013

20140

10

20

30

40

50

60

Genome Proteomics Strains

#

VectorBase is also:

• A service providing tools for browsing and mining vector “-omics” data

• A content generator– Mostly genome annotation (later talks)

• Committed to regular releases (5-6 per year)

• A help desk to help our community on genome informatics and are responsible for facilitating data submission

4

In the end, VectorBase is a team

5And YOU!

Left side: • Welcome message• Available data• Tools and

Resources

Right side:• Past jobs• Organisms (2)• Latest news 7

Left side:• Community

Right side:• Rotating tips• Newsletters• Upcoming meetings

8

11

This is the new organism page:Collects strain, data, and relevant tools

~3700-8300 jobs per month

Mostly Anopheles but other species

Web development goals (2015)

• Patching/ upgrading webApollo instances (1)– multiple genomes in one instance– reworked framework to improve performance

• Integrating subcontractor work with Drupal CMS (2)– Easier releases and better cross site development

• Sitewide authentication for single user accounts– Drupal– Web Apollo– Galaxy

Modifying webApollo

example

Advanced Search

Antelmo (ND) is making Advanced Search more stable and intuitive via Drupal and SOLR-> Also allows looking at saved search, for advanced analysis of BRC usage-> Now running 4.x SOLR to further support PopBio

Current VectorBase variation + PopBio dataflows.

VCF

ISA-TAB

Sample +variationsetids

Ensemblvariationdatabase

PopBio

Display of variant data in genomic context

Display of detailed sample metadata, e.g. geodata

Use of Apache Solr to provide unified search (and thus integration)

across the BRC

VCF

Ensemblvariationdatabase

PopBio

Display of variant data in genomic context

Display of detailed sample metadata, e.g. geodata

ISA-TAB

PopBio import

• Current size: 121 projects, 57637 samples, 172,636 assays (of which 4,387 are IR)

• At present loading can be done overnight, but this may change

• Web interface is not slow due to “pre loading,” which definitely isn’t scalable

PopBio plansMap interface: delivery June release + Kolymbari + ICEMR

meetings

Spreadsheet submission wizard development scheduled for Fall 2015.

Year 2: Sample x genotype browser development, including e! REST and variation Solr work.

Year 2: Refactor project pages with scalable (but still flexible) data transfer (probably also Solr-driven) & update graphics.

Scaling up to millions of SNPs, thousands of samples

Plan to develop or modify something similar to MalariaGen's Panoptes with richer/more flexible metadata capabilities:

Upcoming genome updates

• June 2015– sandflies x 2– anopheles assembly updates x 4

• Summer 2015 - QC of Glossina workshop data, 16G data

• August 2015 – Release of malariaGen 1000G data (pending publication plans); we expect ~50 million new malaria mosquito variants by the end of summer.

• October - Glossinas x 6

Updating genes and assemblies• We recently supported the Glossina gene annotation workshop held in

Kenya (3/2015). The workshop data will be integrated into the existing Glossina databases for release in late 2015. A new database for the final species (Glossina palpalis) will also be created for release in late 2015.

• Assembly updates for An. farauti, An. melas, An. merus and An. sinensis

have been examined to assess whether we can project gene information onto the new assemblies. Over 90% of transcripts could be projected and we intend to schedule the assembly updates for Q3 2015.

• New databases have been proposed for Sarcoptes scabiei var canis, and Aedes albopictus.

• Emrich, Hahn, Lawnziak and Besansky will submit a new reference genome of An. gambiae (S) for summer 2015.

Improved EBI production• Data management systems

Webapollo databases have been set up for 32 organisms, and are being actively used by the community for Biomphalaria glabrata (snail), Phlebotamus papatasi and Lutzomyia longipalpis (sandflies), Musca domesticus (house fly) and the five

current Glossina (tsete) species.

• IT infrastructureVectorBase production pipelines are being migrated to the EBI eHive system

(https://github.com/Ensembl/ensembl-hive). This encourages standardization of our code base, and also allows using EBI parallel computing resources.

• Analysis toolsNew pipelines for xrefs, search, protein alignment and exonerate based

sequence alignments have been developed using the eHive system. This has allowed us to speed up run times in addition to the advantages above.

Future production work at EBISearch

• We had previously experienced scaling problems with the generation of Solr indices for the VectorBase search, and have now rewritten the core gene Solr gene index generation for eHive.

Updating genome data

• Projection of gene descriptions between closely related orthologs will be introduced in an attempt to improve basal gene annotation in some of the new species. First deployment of this code is scheduled for June 2015.

• Transcript, genomic sequence and GTF/GFF dumping have been included in the eHivr

pipeline, but data files are still updated on the VectorBase drupal site in a manual fashion.

• Adding the UCSC track hub system to facilitate metadata and additional “-omics” data

the new vectorbase: our improved resource for invertebrate vectors scott emrich on behalf of...

Documents

variation solr work

data submission

new vectorbase

x solr

unified search

flexible data transfer

saved search

display14use of apache