elixir pilot actions launched in 2014: integration of bils-proteomexchange using eudat resources

38
European Life Sciences Infrastructure for Biological Information www.elixir-europe.org “BILS-ProteomeXchange integration using EUDAT resources” ELIXIR-Pilot Project Dr. Juan A. Vizcaíno, EMBL-EBI, [email protected] Dr. Fredrik Levander, BILS, [email protected]

Upload: juan-antonio-vizcaino

Post on 20-Jul-2015

176 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

European Life Sciences Infrastructure for Biological Informationwww.elixir-europe.org

“BILS-ProteomeXchange integration using EUDAT resources”

ELIXIR-Pilot Project

Dr. Juan A. Vizcaíno, EMBL-EBI, [email protected]. Fredrik Levander, BILS, [email protected]

Page 2: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

• Andy Jenkinson (Systems group) • Rui Wang (PRIDE)• Juan A. Vizcaíno (PRIDE)

• Fredrik Levander• Samuel Lampa• Janos Nagy• Mikael Borg

• Jani Heikkinen

Main people involved directly in this pilot

Page 3: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

• Short intro to PRIDE & ProteomeXchange, BILS and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions

Overview

Page 4: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

• Short intro to PRIDE & ProteomeXchange, BILS and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions

Overview

Page 5: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

• PRIDE stores mass spectrometry (MS)-based proteomics data:

• Peptide and protein expression data (identification and quantification)

• Post-translational modifications

• Mass spectra (raw data and peak lists)

• Technical and biological metadata

• Any other related information

• Full support for tandem MS approaches

PRIDE (PRoteomics IDEntifications) database

http://www.ebi.ac.uk/prideMartens et al., Proteomics, 2005Vizcaíno et al., NAR, 2013

Page 6: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

ProteomeXchange Consortium•Goal: Development of a framework to allow

standard data submission and dissemination pipelines between the main existing proteomics repositories.

•Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and MassIVE (UCSD, San Diego).

•Tranche and Peptidome initially included but discontinued.

•Common identifier space (PXD identifiers)

•Two supported data workflows: MS/MS and SRM.

•Main objective: Make life easier for researchers

http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014

Page 7: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

ProteomeCentral

Metadata / Manuscript

Raw Data*

Results

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE (MS/MS data)

ProteomeXchange data workflow: PRIDE

Vizcaíno et al., Nat Biotechnol, 2014

Page 8: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or

peak list spectra in a standardized format (mzML, mzXML).

2. Result files:

a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.

b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.

3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.

4. Other files: Optional files:1. QUANT: Quantification related results e. FASTA2. PEAK: Peak list files f. SP_LIBRARY3. GEL: Gel images4. OTHER: Any other file type

Published

RawFiles

Other files

Page 9: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Current PSI Standard File Formats for MS

Page 10: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

PRIDE Components: Submission Process

PRIDE Converter 2

PRIDE Inspector PX Submission Tool

mzIdentML

PRIDE XML

Page 11: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Origin: 396 USA224 Germany191 United Kingdom106 Netherlands105 China104 France 94 Switzerland 75 Canada 55 Japan 55 Spain 54 Denmark 52 Sweden 50 Belgium 48 Australia 34 Austria 25 Norway 23 Taiwan 22 India 21 Finland 20 Ireland 20 Italy 16 Brazil 15 Russia 14 Republic of Korea 10 Israel 10 Singapore …

ProteomeXchange: 1,963 datasets up until 1st April, 2015

Type:

613 PRIDE complete

1177 PRIDE partial

79 PeptideAtlas/PASSEL complete

69 MassIVE

25 reprocessed

Publicly Accessible:

959 datasets, 49% of all

88% PRIDE

9% PASSEL

3% MassIVE

Data volume:

Total: ~102 TB

Number of all files: ~250,000

PXD000320-324: ~ 5 TB

PXD000065: ~ 1.4TB

Datasets/year:

2012: 102

2013: 527

2014: 963

2015: 371

Top Species studied by at least 20 datasets:

839 Homo sapiens

232 Mus musculus

79 Arabidopsis thaliana

77 Saccharomyces cerevisiae

44 Rattus norvegicus

35 Escherichia coli

21 Bos taurus

21 Glycine max

~ 460 species in total

Page 12: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

BILS – Bioinformatics Infrastructure for Life Sciences

• Distributed national research infrastructuresupported by the Swedish Research Council• Coordination with other bioinformatics activities

• BILS provides:• Bioinformatics support (consultancy) • Bioinformatics infrastructure (data and tools)

Computing and storage is provided in collaboration with SNIC

• Bioinformatics network • Nodes at each of the 6 large university cities• Annual workshop• Training• Coordination with other bioinformatics activities

• Swedish node in ELIXIR

Page 13: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Main BILS proteomics support aims

• Data storage:• Secure• Long-time• Metadata• Automated• Publishing• Standardised formats

• Data processing:• Accessible data processing workflows

Page 14: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Proteios: Software environment for proteomics

web browser access and analysisof own data only

BILS ScriptsBILS

Scripts

Public access to released raw data Häkkinen et al. (2009) J Proteome Res

A multi-user platform for analysis and management of proteomics data

Page 15: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

EUDAT• EUDAT aims to contribute to building and operating a

Collaborative Data Infrastructure for European science.

• This involves a suite of co-ordinated and interoperable services for preserving scientific data, and for making them accessible to researchers.

• EUDAT collaborates with research communities across a range of disciplines, from social sciences to environmental science and including molecular biology (as represented by ELIXIR).

• These communities have diverse structures, cultures and scales but also share some common requirements regarding the management of data. http://www.eudat.eu

Page 16: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

EUDAT services

http://www.eudat.eu

Page 17: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

B2SAFE

Page 18: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

EUDAT: B2SAFE AND iRODS• B2SAFE aims to provide a software ecosystem for

persistently available data, including persistent identification, abstracted data storage, and reliable automated replication via auditable rules.

• It is built on top of the iRODS data management software (http://irods.org) and integrates a PID system such as the European Persistent Identification Consortium (EPIC - (http://www.pidconsortium.eu) Handle API).

Page 19: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

• PRIDE, ProteomeXchange, BILS and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions

Overview

Page 20: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Objective

• To integrate the data repositories for MS proteomics data run by BILS (Sweden) and ProteomeXchange (via the PRIDE database, EMBL-EBI, UK), using EUDAT’s B2SAFE software.

Page 21: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Plans at European level

National proteomics centers

Metadata

Metadata

Results

Results

RawData

RawData

Central repository

Metadata

Metadata

Results

Results

RawData

RawData

Data storage centers

Metadata

Metadata

RawData

RawData

1.- ELIXIR replication

2.- EUDAT replication

Page 22: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Objective

• To integrate the data repositories for MS proteomics data run by BILS (Sweden) and ProteomeXchange (via the PRIDE database, EMBL-EBI, UK), using EUDAT’s B2SAFE software.

• This project will also show the potential of collaboration among research infrastructures and e-infrastructures to better manage the data deluge. It will help to evaluate the requirements of such federated systems.

Page 23: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

• Short intro to PRIDE & ProteomeXchange, BILS and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions

Overview

Page 24: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Timeline

•The pilot started when Jani Heikkinen (EUDAT) installed B2SAFE at EMBL-EBI (July 2014).

•Data workflow was defined on September/ October 2014.

•Implementation work happened in parallel, with regular weekly calls from January 2015.

•The pilot is now finishing (May 2015).

Page 25: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Envisioned data workflow (September/October 2014)

• Default B2SAFE rules ->Trigger replication of data from BILS to EBI• PIDS assigned per file

Page 26: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Implementation process (1)

• B2SAFE 3.0.0 (including iRODS 3.3.1) was initially installed at EMBL-EBI.

• However, BILS had moved already to iRODS v4.

• Incompatibility problems were found.

• It was decided to install iRODS 4.0 at the EBI, to solve the incompatibility issue.

• At the time iRODS v4 was not officially supported with iRODS version 4.0.3, so changes were necessary to the original install procedure to accommodate 4.0.3.

Page 27: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Implementation process (2)• EBI and BILS obtained Handle prefixes and made them

available within EPIC. The integration with iRODS was successfully tested.

• The next step was to configure B2SAFE and achieve a test replication of a file from BILS to EBI using the B2SAFE PID creation and file transfer rules.

• Unexpected delays:

• EBI experienced some network issues that affected communications between the EBI and BILS iRODS.

• Two successive bugs were discovered. Both centered on the rule execution engine and prevented B2SAFE from functioning.

• These bugs were solved by EUDAT & iRODs developers.

Page 28: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Implementation process (3)• With workarounds now in place it was possible to manually

trigger a successful replication of a file from BILS to EBI.

• However it became apparent that the authorisation mechanism employed by iRODS in a federation would make the proposed submission workflow difficult to manage in a production environment.

• This means every BILS researcher able to submit data must have a user created for them on the EBI server first. Alternative customised solutions could solve this issue by decoupling the actions of researchers from the replication itself. However this would inevitably add complexity.

Page 29: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Implementation process (4)

• At this point (March 2015) the pilot had overrun (it was expected to last 6 months), with more work required to integrate the B2SAFE replication process with the PRIDE submission pipeline.

• It was decided to halt the process and find an alternative way to achieve the same goals using existing resources.

• A detailed report has been written and has been sent to all the parties involved.

Page 30: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Implemented alternative solution

• Proteios is able to generate the metadata file needed for the submission to ProteomeXchange via PRIDE.

• The PX submission tool was extended to support loading of files not available locally at the moment of submission (URLs are specified).

• As a proof of concept, dataset PXD002037 was submitted to PRIDE. Now it is publicly available.

Page 31: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

PX submission tool updated to streamline BILS submissions

Page 32: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Submitted dataset (now publicly available in PRIDE)

Page 33: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Dataset tags in PRIDE Archive

http://www.ebi.ac.uk/pride/archive/simpleSearch?q=&projectTagFilters=Bioinformatics%20Infrastructure%20for%20Life%20Sciences%20(BILS)%20network%20(Sweden)

- Datasets can be tags with different attributes.- Functionality available in the submission process.- Stable URLs can be generated.

Page 34: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

• Short intro to PRIDE & ProteomeXchange, BILS and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions

Overview

Page 35: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

At present and in the near future…

• EMBL-EBI is involved in the EUDAT 2020 project (PI is Steven Newhouse).

• EMBL-EBI will then continue to collaborate with EUDAT, for gaining experience in the use of this software.

• PRIDE will evaluate the situation in the future to decide if the originally envisioned submission pipeline (based on B2SAFE and IRODS) is implemented.

Page 36: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Conclusions• The pilot establishes that the original use case is not the best

application of B2SAFE at the present time. However, the situation will be kept under review by PRIDE.

• This conclusion is not a reflection on B2SAFE per se, indeed B2SAFE and iRODS have been found to be very flexible and are likely to be interesting candidates for other use cases outside of PRIDE elsewhere in EMBL-EBI or ELIXIR.

• In particular, use cases focused on data management within or between data centres (i.e. bipartite collaborations) or environments where mature data submission, curation and archiving solutions do not already exist.

• In addition, we recommend ELIXIR continues to explore EUDAT services and their relevance in ELIXIR use cases.

Page 37: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

Conclusions: Technical recommendations

• Incorporate a fully-functional RESTful interface for iRODS into B2SAFE, that can be used by a client to avoid installing iCommands on the client machine.

• The security model should be adapted to allow anonymous RW to a specified URL.

• If widespread deployment of EUDAT software is expected, effort must be committed by EUDAT 2020 to make the software more easily and quickly deployable by ‘ordinary’ system administrators.

Page 38: ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources

Juan A. Vizcaí[email protected]

ELIXIR Webinar20 May 2015

• Henning Hermjakob• Steven Newhouse

• Rafael Jimenez

• Bengt Persson

• EUDAT management & developers

Acknowledgements