the new vectorbase: our improved resource for invertebrate vectors scott emrich on behalf of...

22
The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or "consolidate, improve and rationalise” (UK)

Upload: russell-ramsey

Post on 30-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

The new VectorBase: our improved resource for invertebrate vectors

Scott EmrichOn behalf of VectorBase

“bigger, better, faster”Or

“"consolidate, improve and rationalise” (UK)

Page 2: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Ful

l rel

ease

Pre

-rel

ease

d*

Organism pages Raw GenBank datafrom sequencing centers

VectorBase has been mostly a collator of genomes

*

3

**

*

(our) Annotation

Page 3: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Rapid growth, however, in past 5 years

6

20022003

20042005

20062007

20082009

20102011

20122013

20140

10

20

30

40

50

60

Genome Proteomics Strains

#

Page 4: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

VectorBase is also:

• A service providing tools for browsing and mining vector “-omics” data

• A content generator– Mostly genome annotation (later talks)

• Committed to regular releases (5-6 per year)

• A help desk to help our community on genome informatics and are responsible for facilitating data submission

4

Page 5: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

In the end, VectorBase is a team

5And YOU!

Page 6: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve
Page 7: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Left side: • Welcome message• Available data• Tools and

Resources

Right side:• Past jobs• Organisms (2)• Latest news 7

Page 8: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Left side:• Community

Right side:• Rotating tips• Newsletters• Upcoming meetings

8

Page 9: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

11

This is the new organism page:Collects strain, data, and relevant tools

Page 10: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

~3700-8300 jobs per month

Mostly Anopheles but other species

Page 11: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Web development goals (2015)

• Patching/ upgrading webApollo instances (1)– multiple genomes in one instance– reworked framework to improve performance

• Integrating subcontractor work with Drupal CMS (2)– Easier releases and better cross site development

• Sitewide authentication for single user accounts– Drupal– Web Apollo– Galaxy

Page 12: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Modifying webApollo

example

Page 13: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Advanced Search

Antelmo (ND) is making Advanced Search more stable and intuitive via Drupal and SOLR-> Also allows looking at saved search, for advanced analysis of BRC usage-> Now running 4.x SOLR to further support PopBio

Page 14: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Current VectorBase variation + PopBio dataflows.

VCF

ISA-TAB

Sample +variationsetids

Ensemblvariationdatabase

PopBio

Display of variant data in genomic context

Display of detailed sample metadata, e.g. geodata

Page 15: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Use of Apache Solr to provide unified search (and thus integration)

across the BRC

VCF

Ensemblvariationdatabase

PopBio

Display of variant data in genomic context

Display of detailed sample metadata, e.g. geodata

ISA-TAB

Page 16: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

PopBio import

• Current size: 121 projects, 57637 samples, 172,636 assays (of which 4,387 are IR)

• At present loading can be done overnight, but this may change

• Web interface is not slow due to “pre loading,” which definitely isn’t scalable

Page 17: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

PopBio plansMap interface: delivery June release + Kolymbari + ICEMR

meetings

Spreadsheet submission wizard development scheduled for Fall 2015.

Year 2: Sample x genotype browser development, including e! REST and variation Solr work.

Year 2: Refactor project pages with scalable (but still flexible) data transfer (probably also Solr-driven) & update graphics.

Page 18: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Scaling up to millions of SNPs, thousands of samples

Plan to develop or modify something similar to MalariaGen's Panoptes with richer/more flexible metadata capabilities:

Page 19: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Upcoming genome updates

• June 2015– sandflies x 2– anopheles assembly updates x 4

• Summer 2015 - QC of Glossina workshop data, 16G data

• August 2015 – Release of malariaGen 1000G data (pending publication plans); we expect ~50 million new malaria mosquito variants by the end of summer.

• October - Glossinas x 6

Page 20: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Updating genes and assemblies• We recently supported the Glossina gene annotation workshop held in

Kenya (3/2015). The workshop data will be integrated into the existing Glossina databases for release in late 2015. A new database for the final species (Glossina palpalis) will also be created for release in late 2015.

• Assembly updates for An. farauti, An. melas, An. merus and An. sinensis

have been examined to assess whether we can project gene information onto the new assemblies. Over 90% of transcripts could be projected and we intend to schedule the assembly updates for Q3 2015.

• New databases have been proposed for Sarcoptes scabiei var canis, and Aedes albopictus.

• Emrich, Hahn, Lawnziak and Besansky will submit a new reference genome of An. gambiae (S) for summer 2015.

Page 21: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Improved EBI production• Data management systems

Webapollo databases have been set up for 32 organisms, and are being actively used by the community for Biomphalaria glabrata (snail), Phlebotamus papatasi and Lutzomyia longipalpis (sandflies), Musca domesticus (house fly) and the five

current Glossina (tsete) species.

• IT infrastructureVectorBase production pipelines are being migrated to the EBI eHive system

(https://github.com/Ensembl/ensembl-hive). This encourages standardization of our code base, and also allows using EBI parallel computing resources.

• Analysis toolsNew pipelines for xrefs, search, protein alignment and exonerate based

sequence alignments have been developed using the eHive system. This has allowed us to speed up run times in addition to the advantages above.

Page 22: The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve

Future production work at EBISearch

• We had previously experienced scaling problems with the generation of Solr indices for the VectorBase search, and have now rewritten the core gene Solr gene index generation for eHive.

Updating genome data

• Projection of gene descriptions between closely related orthologs will be introduced in an attempt to improve basal gene annotation in some of the new species. First deployment of this code is scheduled for June 2015.

• Transcript, genomic sequence and GTF/GFF dumping have been included in the eHivr

pipeline, but data files are still updated on the VectorBase drupal site in a manual fashion.

• Adding the UCSC track hub system to facilitate metadata and additional “-omics” data