proteomexchange_and_pride_semmeting_2015
TRANSCRIPT
PRIDE and ProteomeXchange: Share and
explore public proteomics datasets like never
before
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
• PRIDE Archive (in the context of ProteomeXchange and
the PSI standards)
• How to submit data to PRIDE: PRIDE tools
• ProteomeCentral, submission and access stats
Overview
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Data sharing in Proteomics
• Public availability of data in proteomics enables:
• Reinterpretation (e.g. data reprocessing with different aims)
• Improved analysis software.
• Change in protein sequence databases (e.g. proteogenomics
studies).
• Consider new post-translational modifications.
• validation of the experimental results reported.
• Specific use cases for proteomics: spectral libraries,
fragmentation models, SRM transitions,…
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• PRIDE stores mass spectrometry (MS)-
based proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2013
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE Archive
• New PRIDE DB archival system from 01/2014. Three iterations
released so far. Still work in progress.
• Very flexible, its development has happened in parallel with:
• Implementation of ProteomeXchange.
• New community PSI data standards: mzIdentML, mzQuantML and
mzTab.
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
ProteomeXchange Consortium
• Goal: Development of a framework to allow standard
data submission and dissemination pipelines
between the main existing proteomics repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and MassIVE (UCSD, San Diego).
• Tranche and Peptidome initially included but
discontinued.
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Current PSI Standard File Formats for MS
• mzTab (Griss et al., MCP, 2014)Final Results
• TraML (Deutsch et al., MCP, 2012)SRM
• mzQuantML (Walter et al., MCP, 2013)Quantitation
• mzIdentML (Jones et al., MCP, 2012)Identification
• mzML (Martens et al., MCP, 2011)MS data
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Current PSI Standard File Formats for MS
• mzTab (Griss et al., MCP, 2014)Final Results
• TraML (Deutsch et al., MCP, 2012)SRM
• mzQuantML (Walter et al., MCP, 2013)Quantitation
• mzIdentML (Jones et al., MCP, 2012)Identification
• mzML (Martens et al., MCP, 2011)MS data
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
mzTab format: tab delimited format (ident/quant)
http://code.google.com/p/mztab/
J. Griss et al., MCP, 2014
Q.W. Xu et al., Proteomics, 2014
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Ways to access data in PRIDE Archive
• PRIDE web interface
• File repository
• REST web service
• PRIDE Inspector tool
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE Archive web interface
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE Archive web interface (2)
• Next: visualization of
spectra (in a couple of
weeks)
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Programmatic access: PRIDE REST web service
http://www.ebi.ac.uk/pride/ws/archive/
• Intending to replace the
most popular functionality
provided by the PRIDE
Biomart interface (now
discontinued)
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
• Introduction to PRIDE Archive (in the context of
ProteomeXchange and the PSI standards)
• How to submit data to PRIDE: PRIDE tools
• ProteomeCentral, submission and access stats
• A sneak peak about data reuse
Overview
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
ProteomeXchange data workflow: PRIDE
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Manuscript published detailing the process
Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission
Example dataset:
PXD000764
- Title: “Discovery of new CSF biomarkers for meningitis in children”
- 12 runs: 4 controls and 8 infected samples
- Identification and quantification data
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or peak list
spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and provided in
their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
Published
RawFiles
Other files
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Complete vs Partial submissions: experimental metadata
Complete Partial
General experimental metadata about the projects is similar.
However, at the assay level information in partial submissions is not so detailed
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Complete
Partial
Complete vs Partial submissions: processed results
For complete submissions, it is possible to connect the spectra with the identification
processed results and they can be visualized.
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or peak list
spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and provided in
their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files (the list can be extended):
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
Published
RawFiles
Other files
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE Components: Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
1
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Search
output
files
Spectra
files
Original data files ‘RESULT’ file generation Final ‘RESULT’ file
PRIDE
XML
‘RESULT’
Before: only file conversion to PRIDE XML
File conversion
PRIDE
Converter
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Tools ‘RESULT’ file generation Final ‘RESULT’ file
mzIdentML
‘RESULT’
Now: native file export
Spectra
files
Mascot
ProteinPilot
Scaffold
PEAKS
MSGF+
Others
Native File export
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Complete submissions
Search
Engine
Results + MS
files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- PeptideShaker
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (from version 5.0)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
- Crux
An increasing number of tools support export to mzIdentML 1.1
- Referenced spectral files need to be submitted as well
(all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Tools ‘RESULT’ file generation Final ‘RESULT’ file
mzTab
‘RESULT’
In the near future: native file export
Spectra
files
Mascot
ProteinPilot
Scaffold
PEAKS
MSGF+
Others
Native File export
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE Components: Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
2
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE Inspector 2
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2
PRIDE Inspector 2 supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML- mzTab Quantitation (work in progress)
https://github.com/PRIDE-Toolsuite/
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE Inspector 2PRIDE Inspector 2
https://github.com/PRIDE-Toolsuite/
New visualisation
functionality for Protein
Groups
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE Inspector 2PRIDE Inspector 2
Private review of files
submitted to PRIDEhttps://github.com/PRIDE-Toolsuite/
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE Components: Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
3
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
• It selects and captures the mappings between the different types of files included in the
submission.
• It transfers all the files using Aspera (default) or FTP.
PX submission tool
Published
Raw
Other files
http://www.proteomexchange.org/submission
PX
submission
tool
• Command line alternative: some scripting is needed
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PX submission tool: screenshots
Step 3
Step 4
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Fast file transfer with Aspera
- Aspera is the default file transfer protocol to PRIDE:
- PX Submission tool
- Command line
- Up to 50X faster than FTPFile transfer speed should not
be a problem!!
- Also now available for downloading files
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Partial submissions can be used to store other data workflows
• Everything can be stored, not only MS/MS data (~90% of datasets):
very flexible mechanism to be able to capture all types of datasets
• PRIDE does not store SRM data (it goes to PASSEL)
• Top down proteomics datasets (10 datasets).
• Mass Spectrometry Imaging datasets (1 dataset).
• Data independent acquisition techniques: e.g. SWATH-MS (9
datasets), HDMSE (1 dataset).
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
C
D
From original publication [13] Reconstructed ProteomeXchange data
1. Thermo RAW data / UDP
2. Mirion Software (JLU)
1. Thermo RAW data / UDP
2. Convert to imzML
3. Upload to PRIDE
(EBI, Cambridge, UK)
4. Download from PRIDE
5. Display in MSiReader
- Vendor-independent data format
- Freely available software (open source)
- ‘open data‘ – free to reuse
- Anybody can do this!
Römpp et al., 2014, Anal Bioanal Chem, in press
PRIDE database
European
Bioinformatics
Institute,
Cambridge, UK
3. Upload
4. Download
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
• Introduction to PRIDE Archive (in the context of
ProteomeXchange and the PSI standards)
• How to submit data to PRIDE: PRIDE tools
• ProteomeCentral, submission and access stats
Overview
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
ProteomeXchange data workflow: ProteomeCentral
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
RSS feed for public datasets
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Origin: 322 USA197 Germany148 United Kingdom
91 Netherlands85 France81 China80 Switzerland 61 Canada48 Belgium47 Spain45 Denmark42 Australia 40 Japan37 Sweden28 Austria22 India21 Norway21 Taiwan20 Ireland20 Finland17 Italy14 Brazil13 Republic of Korea13 Russia10 Israel
9 Singapore …
ProteomeXchange: 1620 datasets up until 8th January 2015
Type:
526 PRIDE complete (32.5%)
982 PRIDE partial (60.6%)
63 PeptideAtlas/PASSEL complete
24 MassIVE
25 reprocessed
Publicly Accessible:
814 datasets, 50% of all
90% PRIDE
8% PASSEL
2% MassIVE
Data volume:
Total: ~71 TB
Number of all files: ~160,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Top Species studied by at least 10 datasets:
712 Homo sapiens
193 Mus musculus
65 Saccharomyces cerevisiae
61 Arabidopsis thaliana
35 Rattus norvegicus
34 Escherichia coli
17 Bos taurus
17 Glycine max
17 Mycobacterium tuberculosis
16 Drosophila melanogaster
14 Oryza sativa
~ 310 species in total
Datasets/year:
2012: 102
2013: 527
2014: 963
2015: 28
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
PRIDE: Submitted datasets per month
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Access statistics: PRIDE File repository
2014: The rise of proteomics data re-use
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Which are the most accessed datasets?
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Vaudel M, Barsnes H, Berven FS, Sickmann A, Martens L:
Proteomics 2011;11(5):996-9.
http://searchgui.googlecode.com http://peptide-shaker.googlecode.com
Vaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L, Barsnes
H:
Nature Biotechnology 2015; 33(1):22-4.
PeptideShaker facilitates reuse of PRIDE data
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014
•Around 60% of the data used for the
analysis comes from previous experiments,
most of them stored in proteomics repositories
such as PRIDE/ProteomeXchange, PASSEL
or MassIVE.
•They complement that data with “exotic”
tissues.
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
• Data submission and data reuse in the field are rising.
• PRIDE and ProteomeXchange enable this for you.
• Data standards are key for us.
• Quantification data depends on mzTab support.
Conclusions
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Aknowledgements: People
Attila Csordas
Tobias Ternent
Noemi del Toro
Rui Wang
Florian Reisinger
Jose A. Dianes
Johannes Griss
Steven Lewis
Yasset Perez-Riverol
Henning Hermjakob
All ProteomeXchange partners,
especially Eric Deutsch and Nuno
Bandeira
Acknowledgements: The PRIDE Team and collaborators
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Acknowledgements: Funding
http://www.proteomexchange.org
http://code.google.com/p/pride-converter-2/
@pride_ebi
Acknowledgements
Juan A. Vizcaí[email protected]
Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015
Questions?