data volumes in proteomics data resources: pride and proteomexchange
TRANSCRIPT
Data volumes in proteomics data resources: PRIDE and ProteomeXchange
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• PRIDE stores mass spectrometry (MS)-based proteomics data:
• Peptide and protein expression data (identification and quantification)
• Post-translational modifications• Mass spectra (raw data and peak lists)• Technical and biological metadata• Any other related information
• Full support for tandem MS approaches
Martens et al., Proteomics, 2005Vizcaíno et al., NAR, 2013
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
PRIDE (PRoteomics IDEntifications) database
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
ProteomeXchange Consortium•Goal: Development of a framework to allow
standard data submission and dissemination pipelines between the main existing proteomics repositories.
•Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and MassIVE (UCSD, San Diego).
•Tranche and Peptidome initially included but discontinued.
•Common identifier space (PXD identifiers)
•Two supported data workflows: MS/MS and SRM.
•Main objective: Make life easier for researchers
http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
ProteomeCentral
Metadata / Manuscript
Raw Data*
Results
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL (SRM data)
PRIDE (MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE (MS/MS data)
ProteomeXchange data workflow: PRIDE
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
Current PSI Standard File Formats for MS
• mzTab (Griss et al., MCP, 2014)Final Results
• TraML (Deutsch et al., MCP, 2012)SRM
• mzQuantML (Walter et al., MCP, 2013)Quantitation
• mzIdentML (Jones et al., MCP, 2012)Identification
• mzML (Martens et al., MCP, 2011)MS data
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or peak list
spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.
4. Other files: Optional files:a. QUANT: Quantification related results e. FASTAb. PEAK: Peak list files f. SP_LIBRARYc. GEL: Gel imagesd. OTHER: Any other file type
Published
RawFiles
Other files
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
Raw file sizes are increasing
• Most raw data files in PRIDE are binary files in proprietary format. mzML is submitted some times as raw data as well, but not too often.
• We gzip all the XML files (mzML, mzIdentML, PRIDE XML)
• We have been asked by our users to convert all binary files to mzML -> Very costly in terms of storage at EBI (also taking into account all the backups).
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
New proteomics approaches get even bigger files
• Data coming from other proteomics workflows can be stored in PRIDE, not only MS/MS data (~95% of datasets): very flexible mechanism to be able to capture all types of datasets.
• Data independent acquisition (DIA) techniques are increasingly more popular, e.g. SWATH-MS, MSe, HDMSe, etc -> Big raw data sizes.
• Other data types: MS imaging, top down proteomics, etc.
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
PRIDE Components: Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
3
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
PX submission tool: screenshots
Step 3
Step 4
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
Fast file transfer with Aspera
- Aspera is the default file transfer protocol to PRIDE:- PX Submission tool- Command line
- Much faster than FTP
- Also now available for downloading files
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
PRIDE: Submitted datasets per month
2012
-01
2012
-03
2012
-05
2012
-07
2012
-09
2012
-11
2013
-01
2013
-03
2013
-05
2013
-07
2013
-09
2013
-11
2014
-01
2014
-03
2014
-05
2014
-07
2014
-09
2014
-11
2015
-01
2015
-03
2015
-05
2015
-07
0
20
40
60
80
100
120
140
160
180
200
Processed PRIDE/PX submissions per month
150-200 datasets/month
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
PRIDE: Submitted datasets per month
03-2
012
05-2
012
07-2
012
09-2
012
11-2
012
01-2
013
03-2
013
05-2
013
07-2
013
09-2
013
11-2
013
01-2
014
03-2
014
05-2
014
07-2
014
09-2
014
11-2
014
01-2
015
03-2
015
05-2
015
0.00
20.00
40.00
60.00
80.00
100.00
120.00
Cumulative size of PRIDE PX data in TB
month
size
in
TB
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
PRIDE: Size comparison with other EBI resources (May 2015)
2004 2006 2008 2010 2012 2014 20161E+07
1E+12
1E+17
Data accumulation by resourceMetabolites
PRIDE
EGA
ENA (less AE)
AE
date
byt
es
Chart generated by Guy Cochrane
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
Data size of Mass Spectrometry data at the EBI (May 2015)
2004 2006 2008 2010 2012 2014 20161E+07
1E+12
1E+17Data accumulation by platform
se-quencearrayMS
date
byt
es
Chart generated by Guy Cochrane
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
Will this trend continue?
• Journals are increasingly mandating submission of (at least) raw data to proteomics resources. As an example, MCP made this mandatory since July 1st 2015.
• Funders are also pushing into this direction.
• Change of culture in the field -> More widely accepted as a good scientific practice.
• Reanalysis/ reuse of datasets to extract new knowledge is also increasing (e.g., proteogenomics, spectral libraries, PTMs,..).
• Raw files are becoming increasingly bigger (e.g. ion mobility, DIA approaches)
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
Conclusions
• The amount of proteomics data in the public domain is increasing very fast since the formalisation of the ProteomeXchange Consortium
• PRIDE is the world-leading resource in this context, and is now receiving between 150 and 200 datasets/month.
• It is expected that this trend will continue.
Juan A. Vizcaí[email protected]
Metabolomics data compression workshop29 July 2015
Aknowledgements: PeopleAttila CsordasTobias TernentNoemi del ToroRui Wang
Johannes GrissYasset Perez-Riverol
Henning Hermjakob
All past team members, especially Florian Reisinger and Jose A. Dianes
All ProteomeXchange partners, especially Eric Deutsch and Nuno Bandeira
Acknowledgements: The PRIDE Team and collaborators