data volumes in proteomics data resources: pride and proteomexchange

20
Data volumes in proteomics data resources: PRIDE and ProteomeXchange Dr. Juan Antonio Vizcaíno PRIDE Group Coordinator Proteomics Services Team EMBL-EBI Hinxton, Cambridge, UK

Upload: juan-antonio-vizcaino

Post on 18-Aug-2015

73 views

Category:

Science


1 download

TRANSCRIPT

Data volumes in proteomics data resources: PRIDE and ProteomeXchange

Dr. Juan Antonio Vizcaíno

PRIDE Group Coordinator

Proteomics Services Team

EMBL-EBI

Hinxton, Cambridge, UK

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

PRIDE (PRoteomics IDEntifications) database

http://www.ebi.ac.uk/pride

• PRIDE stores mass spectrometry (MS)-based proteomics data:

• Peptide and protein expression data (identification and quantification)

• Post-translational modifications• Mass spectra (raw data and peak lists)• Technical and biological metadata• Any other related information

• Full support for tandem MS approaches

Martens et al., Proteomics, 2005Vizcaíno et al., NAR, 2013

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

PRIDE (PRoteomics IDEntifications) database

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

ProteomeXchange Consortium•Goal: Development of a framework to allow

standard data submission and dissemination pipelines between the main existing proteomics repositories.

•Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and MassIVE (UCSD, San Diego).

•Tranche and Peptidome initially included but discontinued.

•Common identifier space (PXD identifiers)

•Two supported data workflows: MS/MS and SRM.

•Main objective: Make life easier for researchers

http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

ProteomeCentral

Metadata / Manuscript

Raw Data*

Results

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE (MS/MS data)

ProteomeXchange data workflow: PRIDE

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

Current PSI Standard File Formats for MS

• mzTab (Griss et al., MCP, 2014)Final Results

• TraML (Deutsch et al., MCP, 2012)SRM

• mzQuantML (Walter et al., MCP, 2013)Quantitation

• mzIdentML (Jones et al., MCP, 2012)Identification

• mzML (Martens et al., MCP, 2011)MS data

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or peak list

spectra in a standardized format (mzML, mzXML).

2. Result files:

a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.

b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.

3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.

4. Other files: Optional files:a. QUANT: Quantification related results e. FASTAb. PEAK: Peak list files f. SP_LIBRARYc. GEL: Gel imagesd. OTHER: Any other file type

Published

RawFiles

Other files

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

Raw file sizes are increasing

• Most raw data files in PRIDE are binary files in proprietary format. mzML is submitted some times as raw data as well, but not too often.

• We gzip all the XML files (mzML, mzIdentML, PRIDE XML)

• We have been asked by our users to convert all binary files to mzML -> Very costly in terms of storage at EBI (also taking into account all the backups).

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

New proteomics approaches get even bigger files

• Data coming from other proteomics workflows can be stored in PRIDE, not only MS/MS data (~95% of datasets): very flexible mechanism to be able to capture all types of datasets.

• Data independent acquisition (DIA) techniques are increasingly more popular, e.g. SWATH-MS, MSe, HDMSe, etc -> Big raw data sizes.

• Other data types: MS imaging, top down proteomics, etc.

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

PRIDE Components: Submission Process

PRIDE Converter 2

PRIDE Inspector PX Submission Tool

mzIdentML

PRIDE XML

3

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

PX submission tool: screenshots

Step 3

Step 4

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

Fast file transfer with Aspera

- Aspera is the default file transfer protocol to PRIDE:- PX Submission tool- Command line

- Much faster than FTP

- Also now available for downloading files

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

PRIDE: Submitted datasets per month

2012

-01

2012

-03

2012

-05

2012

-07

2012

-09

2012

-11

2013

-01

2013

-03

2013

-05

2013

-07

2013

-09

2013

-11

2014

-01

2014

-03

2014

-05

2014

-07

2014

-09

2014

-11

2015

-01

2015

-03

2015

-05

2015

-07

0

20

40

60

80

100

120

140

160

180

200

Processed PRIDE/PX submissions per month

150-200 datasets/month

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

PRIDE: Submitted datasets per month

03-2

012

05-2

012

07-2

012

09-2

012

11-2

012

01-2

013

03-2

013

05-2

013

07-2

013

09-2

013

11-2

013

01-2

014

03-2

014

05-2

014

07-2

014

09-2

014

11-2

014

01-2

015

03-2

015

05-2

015

0.00

20.00

40.00

60.00

80.00

100.00

120.00

Cumulative size of PRIDE PX data in TB

month

size

in

TB

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

PRIDE: Size comparison with other EBI resources (May 2015)

2004 2006 2008 2010 2012 2014 20161E+07

1E+12

1E+17

Data accumulation by resourceMetabolites

PRIDE

EGA

ENA (less AE)

AE

date

byt

es

Chart generated by Guy Cochrane

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

Data size of Mass Spectrometry data at the EBI (May 2015)

2004 2006 2008 2010 2012 2014 20161E+07

1E+12

1E+17Data accumulation by platform

se-quencearrayMS

date

byt

es

Chart generated by Guy Cochrane

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

Will this trend continue?

• Journals are increasingly mandating submission of (at least) raw data to proteomics resources. As an example, MCP made this mandatory since July 1st 2015.

• Funders are also pushing into this direction.

• Change of culture in the field -> More widely accepted as a good scientific practice.

• Reanalysis/ reuse of datasets to extract new knowledge is also increasing (e.g., proteogenomics, spectral libraries, PTMs,..).

• Raw files are becoming increasingly bigger (e.g. ion mobility, DIA approaches)

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

Conclusions

• The amount of proteomics data in the public domain is increasing very fast since the formalisation of the ProteomeXchange Consortium

• PRIDE is the world-leading resource in this context, and is now receiving between 150 and 200 datasets/month.

• It is expected that this trend will continue.

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

Aknowledgements: PeopleAttila CsordasTobias TernentNoemi del ToroRui Wang

Johannes GrissYasset Perez-Riverol

Henning Hermjakob

All past team members, especially Florian Reisinger and Jose A. Dianes

All ProteomeXchange partners, especially Eric Deutsch and Nuno Bandeira

Acknowledgements: The PRIDE Team and collaborators

Juan A. Vizcaí[email protected]

Metabolomics data compression workshop29 July 2015

Questions?