deliverable 9.5 - phenomenal · 7 figure 3: an example galaxy workflow showing tools developed to...

9
Deliverable 9.5.2 Project ID 654241 Project Title A comprehensive and standardised e-infrastructure for analysing medical metabolic phenotype data. Project Acronym PhenoMeNal Start Date of the Project 1st September 2015 Duration of the Project 36 Months Work Package Number 9 Work Package Title WP9 Tools, Workflows, Audit and Data Management Deliverable Title D9.5.2 Updated Data processing Virtual Machine Image2 Delivery Date M34 Work Package leader IPB Contributing Partners EMBL-EBI, IPB, CEA, UOXF Authors Ken Haug, Pablo Moreno, Steffen Neumann, David Johnson, Pierrick Roger Mele Abstract The PhenoMeNal project supports several of the most common workflows in

Upload: others

Post on 16-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deliverable 9.5 - PhenoMeNal · 7 Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking

Deliverable 9.5.2

Project ID 654241

Project Title A comprehensive and standardised e-infrastructure for analysing medical metabolic phenotype data.

Project Acronym

PhenoMeNal

Start Date of the Project

1st September 2015

Duration of the Project

36 Months

Work Package Number

9

Work Package Title

WP9 Tools, Workflows, Audit and Data Management

Deliverable Title

D9.5.2 Updated Data processing Virtual Machine Image2

Delivery Date M34

Work Package leader

IPB

Contributing Partners

EMBL-EBI, IPB, CEA, UOXF

Authors Ken Haug, Pablo Moreno, Steffen Neumann, David Johnson, Pierrick Roger Mele

Abstract

The PhenoMeNal project supports several of the most common workflows in

Page 2: Deliverable 9.5 - PhenoMeNal · 7 Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking

2

metabolomics. An important aspect is the standardised data exchange to and from the MetaboLights repository, hosted at the European Bioinformatics Institute (EMBL-EBI), and the support of Metabolomics Standard Initiative (MSI) compliant metadata through the ISA framework. This deliverable describes the ISA datatype in Galaxy and Galaxy tools for handling ISA data sets implemented in PhenoMeNal. We also describe the MetaboLights downloader used to import data into a Galaxy instance and the corresponding MetaboLights uploader.

Table of Contents

1 Executive Summary ............................................................................................................. 32 Contribution towards the project objectives .......................................................................... 33 Detailed report on the deliverable ........................................................................................ 4

3.1 ISA-Tab support in PhenoMeNal ................................................................................. 43.1.1 ISA datatype in Galaxy ............................................................................................ 4

3.2 MetaboLights downloader............................................................................................ 83.3 MetaboLights uploader ................................................................................................ 9

4 Delivery and Schedule ......................................................................................................... 95 Conclusion ........................................................................................................................... 9

Page 3: Deliverable 9.5 - PhenoMeNal · 7 Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking

3

1 Executive Summary

The PhenoMeNal project supports several of the most common workflows in metabolomics covering, amongst others, NMR and Mass Spectrometry data processing. An important aspect is the standardised data exchange. MetaboLights1 is a general purpose, open-access database repository for metabolomics research, hosted at the European Bioinformatics Institute (EMBL-EBI). MetaboLights is based upon the open-source ISA framework2, and provides Metabolomics Standard Initiative3 (MSI) compliant metadata and raw experimental data associated with metabolomics experiments.

In deliverable D9.2.2 PhenoMeNal-Data Virtual Machine image to enable sharing and dissemination of standardised and processed omics data to participating online repositories, like MetaboLights submitted M144, we reported on the PhenoMeNal-Data container images. These were created to enable sharing and dissemination of standardised and processed omics data to and from MetaboLights. We additionally described how we handle primary research data files (raw data) in the PhenoMeNal Virtual Research Environment (VRE) and the interactions with the MetaboLights repository. We provide easy-to-use mechanisms for fast and secure data transfer between the VRE and MetaboLights.

This deliverable D9.5.2 reports on the main changes we have implemented for the Data container images since submitting D9.2.2.

2 Contribution towards the project objectives

A summary of contribution of work towards the project objectives:

Objective 9.1: Specify and integrate software pipelines and tools utilised in the PhenoMeNal e-Infrastructure into VMIs, adhering to data standards developed in WP8 and supporting the interoperability and federation middleware developed in WP5. Most tools will be already available (see table 1.1) and we will develop new applications to complete ‘missing links’ in pipelines. Although two explicit releases for VMIs are listed as deliverables below, we will use public repositories and continuous integration to always provide development snapshots of the infrastructure VMIs.

We developed the new Galaxy ISA datatype. New tools, described below, were developed to produce and support this data type. All the packaged tools are available on the PhenoMeNal public container registry.

Objective 9.2: Develop methods to scale-up software pipelines for high-throughput analysis, supporting distributed execution on e.g. local clusters, private clouds, federated clouds, or GRIDs.

1 https://www.ebi.ac.uk/metabolights/ 2 https://isa-tools.org 3 http://www.metabolomics-msi.org 4 http://phenomenal-h2020.eu/home/wp-content/uploads/2016/09/D9.2.2PhenoMeNal-DataVirtualmachine.pdf

Page 4: Deliverable 9.5 - PhenoMeNal · 7 Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking

4

All the software packaged in containers has been rigorously tested and currently run on scalable infrastructure, utilising Kubernetes container orchestrator on the EMBL-EBI EMBASSY OpenStack installation. This is done through our continuous integration (CI) system (Jenkins),5 which, whenever a new version of one of the tool containers is built by the CI, additional tests are triggered where the container tool is used to process real data on a Kubernetes container orchestrator running at the EMBASSY Cloud, aiming to replicate the same environment that such tool container would be exposed to when running jobs through Galaxy and Kubernetes. Additionally, workflows are tested through the same CI and a PhenoMeNal-deployed Galaxy instance on top of the same Kubernetes cluster.

We have taken steps to simplifying importing large datasets into PhenoMeNal through the use of advanced data transfer clients, IBM Aspera, and methods for transferring only the required portions.

3 Detailed report on the deliverable

3.1 ISA-Tab support in PhenoMeNal

The ISA-Tab (Investigation, Study, Assay) format describes the data and metadata of a metabolomics study. In particular, the information includes the organism under study, samples and experimental factors, analytical methods and both raw and derived metabolite data. It has been used in several -omics disciplines, including metabolomics, and is the underlying data formalism in the MetaboLights repository and supported by software packages for a large number of programming languages, including R and Python.

3.1.1 ISA datatype in Galaxy

The ISA concepts are ideally suited to describe the input to a Galaxy workflow, where experimental metadata is required for the data processing and analysis, but it can also be used to capture the information obtained through the workflow, so that the complete results can be published and submitted to e.g. MetaboLights.

We have implemented the ISA datatype in Galaxy, which implements support for both ISA-Tab and ISA-JSON formats, and submitted a pull request to include this datatype into the standard Galaxy platform for the Galaxy community. The work was presented at the joint 2018 Galaxy Community Conference (GCC2018) and Bioinformatics Open Source Conference 2018 (BOSC2018), Portland, Oregon, USA6, alongside work on ISA-related Galaxy tooling developed for PhenoMeNal. In particular, the poster features the ISAcreate tool that was developed to produce prospective ISA-Tab templates based on study design information provided by users.

5 https://portal.phenomenal-h2020.eu/statistics 6 https://gccbosc2018.sched.com/event/FEWs/g26-isacreate-a-galaxy-tool-for-prospective-data-management-with-isa-format-support-application-to-metabolomics-datasets

Page 5: Deliverable 9.5 - PhenoMeNal · 7 Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking

5

Figure 1: Selected detail of the poster presented at the 2018 Galaxy Community Conference (GCC2018) and Bioinformatics Open Source Conference 2018 (BOSC2018) on “ISAcreate: a Galaxy Tool for Prospective Data Management with ISA format support - Application to Metabolomic Datasets”.

Page 6: Deliverable 9.5 - PhenoMeNal · 7 Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking

6

Figure 2: Tweet about the poster presentation at the GCC/BOSC conference. Original found https://twitter.com/isatools/status/1011367709102223361

In addition to the datatype itself, we have developed several of Galaxy tools for handling ISA datasets7, including tools for format conversion, ISA-Tab validation, metadata exploration through queries over ISA-Tab, visualization, and upload/download from MetaboLights.

7 https://portal.phenomenal-h2020.eu/app-library

Page 7: Deliverable 9.5 - PhenoMeNal · 7 Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking

7

Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking the ISAcreate tool with the other ISA-related tooling, to demonstrate the full workflow from data production with the study design, visualization of the study groups, validation of the ISA-Tab created by the ISAcreate tool, and through to uploading to MetaboLights to gain an study accession ID for prospective upload of raw data later on.

Page 8: Deliverable 9.5 - PhenoMeNal · 7 Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking

8

Figure 4: The Factors visualization resulting from the workflow in Figure 3 above.

3.2 MetaboLights downloader

The MetaboLights downloader is used to import data from the MetaboLights repository into a Galaxy instance. It has been extended to support:

- Data file selection based on study factor values. This feature allows the partial retrieval of studies instead of downloading the whole study. Some studies are as large as tens of gigabytes, and since some processings may not use all available data, slicing data for selecting the files to download is an efficient way to reduce download time as well as size of downloaded study.

- Added ISA data type support. The new Galaxy ISA data type is used to output the downloaded study as a complex dataset in which all downloaded files are stored. Subsequent tools are then able to take the study as input in an easier way, outputting possibly another ISA data type.

We have also optimised the download strategy. Data is now downloaded directly to the destination, which avoids one copy operation and allows to transfer studies larger than local disk space in the Galaxy or Kubernetes nodes. The downloader was also adapted to follow changes in the network setup for the general upload infrastructure at EMBL-EBI, in particular to the UDP ports being used. Another important aspect we will have documented for users in both the tool container and Galaxy tool help is that the networking of the PhenoMeNal instance needs to allow UDP connections with source port 33001 for the fast Aspera download. This is the case for most commercial providers, including Amazon AWS and Google GCP, but some local OpenStack installations might have additional firewall rules in place. See also https://test-

Page 9: Deliverable 9.5 - PhenoMeNal · 7 Figure 3: An example Galaxy workflow showing tools developed to work with ISA datasets in Galaxy. The figure shows an example Galaxy workflow linking

9

connect.asperasoft.com/ for more information. If these connections are not allowed, the fallback to the wget download is still possible.

Details for the latest version of the downloader can be found here8 .

3.3 MetaboLights uploader

The MetaboLights uploader enables any PhenoMeNal infrastructure user to upload data to a new workspace in MetaboLights Labs9. MetaboLights Labs is a working area to stage various files before transforming a set of, metadata and raw, files into a MetaboLights study for publication. The uploader has been extended to support the ISA data type, described in the section above. EMBL-EBI modified the general upload infrastructure, based on FTP and IBM Aspera, so a parameter change was required to ensure continuous operation.

The MetaboLights uploader is registered in the PhenoMeNal Application Library/Service Catalogue10 and is containerised to be included in any default deployment of the PhenoMeNal infrastructure. Available on PhenoMeNal Galaxy instances under "PhenoMeNal H2020 Tools", under the section "Transfer".

The uploader is also available as a conda package in Anaconda Cloud11. Details for the latest version of the uploader can be found here12

4 Delivery and Schedule The delivery is delayed: No

5 Conclusion The changes described in this deliverable is the result of normal software evolution since we reported on D9.2.2. New features have been added where required and tools have been adjusted in response to the evolving changes to workflows and tools integrated in the PhenoMeNal infrastructure. The fact that so few changes have been required is a real testament to the maturity and stability of the technical components in the infrastructure.

8 https://github.com/phnmnl/container-mtbls-dwnld 9 https://www.ebi.ac.uk/metabolights/labs/ 10 https://portal.phenomenal-h2020.eu/app-library/mtbl-labs-uploader 11 https://anaconda.org/cs76/metabolightslabs-cli 12 https://github.com/EBI-Metabolights/MetaboLightsLabs-PythonCLI