ahupo_vizcaino_remote_presentation_082014
DESCRIPTION
ProteomeXchange: Update for the C-HPP Consortium. 10th C-HPP Workshop: “Proteome data management and identification of missing proteins". Bangkok, Thailand. 09/08/2015. Remote presentation.TRANSCRIPT
ProteomeXchange: Update for the C-HPP Consortium
Dr. Juan Antonio Vizcaíno PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Overview • The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Miscellaneous
• Your questions for the discussion
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Overview • The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Miscellaneous
• Your questions for the discussion
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
ProteomeXchange Consortium • Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego).
• EU FP7 CA (01/2011-> 06/2014).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
ProteomeCentral
Metadata / Manuscript
Raw Data*
Results
Journals
UniProt/ neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL (SRM data)
PRIDE (MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE (MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
ProteomeXchange data workflow
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
MassIVE (UCSD)
http://proteomics.ucsd.edu/service/massive/
• Just joined ProteomeXchange on June 2014 • Similar role to PRIDE (although not yet formalised).
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
ProteomeCentral
Metadata / Manuscript
Raw Data*
Results
Journals
UniProt/ neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL (SRM data)
PRIDE (MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE (MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
ProteomeXchange data workflow
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Overview • The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Miscellaneous
• Your questions for the discussion
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• Focused on MS/MS
approaches
Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2013
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Manuscript just out detailing the process
Ternent et al., Proteomics, 2014, in press
http://www.proteomexchange.org/submission
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Complete vs Partial submissions
• The challenge: PRIDE is an archival database, aiming to provide long term access to proteomics data from all workflows
• Huge variety of proteomics workflows and file formats establishes a data management nightmare
• Previously, we had to decline submissions since there were in formats we could not handle.
• Solution: Complete and Partial submissions
• Metadata, raw data, results are mandatory in both cases, just not parsed for partial submissions
• Complete submission • All data in standard formats,
accessible through PRIDE Inspector and web interface (not yet)
• Results searchable in DB
• Submission gets DOI
• Partial submission • Part of data in non-standard formats
• Files are made available to download
• Only metadata searchable
• No DOI
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
PX Data workflow for MS/MS data 1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
Raw Files
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
1. Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.
PX Data workflow for MS/MS data
Published
Raw Files
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Complete submissions Search Engine
Results + MS files
Search engines
mzIdentML
- Mascot - MSGF+ - Myrimatch and related tools from D. Tabb’s lab - OpenMS - PEAKS - ProCon (ProteomeDiscoverer, Sequest) - Scaffold - TPP via the idConvert tool (ProteoWizard) - ProteinPilot (planned by the end of 2014) - Others: library for X!Tandem conversion, lab internal pipelines, …
An increasing number of tools support export to mzIdentML 1.1
- Referenced spectral files need to be submitted as well
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Available for complete submissions
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports: - PRIDE XML - mzIdentML + all types of spectra files - mzML - mzTab Ident (work in progress)
http://code.google.com/p/pride-toolsuite/wiki/PRIDEInspector
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
PX Data workflow for MS/MS data 1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.
Published
Raw Files
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
PX Data workflow for MS/MS data 1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.
4. Other files: Optional files: a. QUANT: Quantification related results e. FASTA b. PEAK: Peak list files f. SP_LIBRARY c. GEL: Gel images d. OTHER: Any other file type
Published
Raw Files
Other files
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
• Capture the mappings between the different types of files.
• Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP).
PX submission tool
Published
Raw
Other files
http://www.proteomexchange.org/submission
PX submission
tool
• Command line alternative: some scripting is needed
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Fast file transfer with Aspera
- Aspera is the default file transfer protocol to PRIDE: - PX Submission tool - Command line
- Up to 50X faster than FTP
Part C: Difficulties in Connections: Q1. How to make the connections between local server and central DBs much faster and accessible (e.g. local server and ProteomeXchange)?
File transfer speed should not be a problem!!
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
PX submission tool: HPP tags
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Complete vs Partial submissions
• The challenge: PRIDE is an archival database, aiming to provide long term access to proteomics data from all workflows
• Huge variety of proteomics workflows and file formats establishes a data management nightmare
• Previously, we had to decline submissions since there were in formats we could not handle.
• Solution: Complete and Partial submissions
• Metadata, raw data, results are mandatory in both cases, just not parsed for partial submissions
• Complete submission • All data in standard formats,
accessible through PRIDE Inspector and web interface (not yet)
• Results searchable in DB
• Submission gets DOI
• Partial submission • Part of data in non-standard formats
• Files are made available to download
• Only metadata searchable
• No DOI
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Partial submissions • Everything can be stored, not only MS/MS data: very flexible
mechanism to be able to capture all types of datasets
• Top down proteomics datasets
• Mass Spectrometry Imaging datasets
• Data independent acquisition techniques: e.g. SWATH-MS datasets, among other DIA approaches
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Origin: 235 USA 142 Germany 97 United Kingdom 67 Switzerland 64 Netherlands 62 China 60 France 48 Canada 43 Spain 36 Belgium 32 Sweden 29 Australia 26 Denmark 23 Japan 18 Taiwan 17 India 16 Ireland 14 Norway 14 Italy 12 Finland 11 Republic of Korea 10 Brazil 8 Austria 7 Israel 7 Singapore …
ProteomeXchange: 1,148 datasets up until August 2014
Type: 386 PRIDE complete 687 PRIDE partial 51 PeptideAtlas/PASSEL complete 1 MassIVE 23 reprocessed
Publicly Accessible: 544 datasets, 50% of all 90% PRIDE 10% PASSEL
Data volume: Total: ~51 TB Number of all files: ~130,000 PXD000320-324: ~ 5 TB PXD000065: ~ 1.4TB
Top Species studied by at least 10 datasets: 510 Homo sapiens 142 Mus musculus 46 Saccharomyces cerevisiae 45 Arabidopsis thaliana 23 Rattus norvegicus 16 Escherichia coli 15 Bos taurus 15 Mycobacterium tuberculosis 13 Oryza sativa 12 Drosophila melanogaster 12 Glycine max ~ 265 species in total
Datasets/year: 2012: 102 2013: 527 2014: 519
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Overview • The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Miscellaneous
• Your questions for the discussion
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
ProteomeCentral
Metadata / Manuscript
Raw Data*
Results
Journals
UniProt/ neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL (SRM data)
PRIDE (MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE (MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
ProteomeXchange data workflow
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Get notified about new PX datasets
- Subscribe to the RSS Feed to receive information about the new datasets:
http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml
Proteome Central Researchers
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Reuse of datasets in PeptideAtlas can be tracked
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Overview • The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Miscellaneous
• Your questions for the discussion
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
HPP datasets are now tagged The Projects are now tagged and can be browsed as a group of data sets.""
Tags for: HPP, C-HPP and B/D-HPP
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
HPP PX datasets: some numbers Since January 2014, we started capturing the PI information - 25 HPP datasets: 22 C-HPP and 3 B/D-HPP
- Countries represented in C-HPP: - 5 Spain - 4 South Korea - 3 Brazil, China Only a small proportion of the datasets have been made publicly available, at least through ProteomeXchange
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
For the near future in PRIDE…
• Complete data workflow for data visualization in PRIDE Archive web
• Improvement of existing PRIDE REST API
• Incorporation of reprocessed data in PRIDE (in collaboration with Prof. L. Martens (VIB/Ghent) and Dr. A. Jones (U. Liverpool)
• Integration of the data in the EBI Molecular Atlas
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Conclusions • ProteomeXchange is widely used. It has now a new
consortium member: MassIVE (UCSD).
• PRIDE is getting a lot of data via ProteomeXchange. Pipeline in production since summer 2012. More than 1,100 datasets have been already submitted.
• Half of them already public.
• Different open source tools available to facilitate the process: • File transfer speed should not be a problem (Aspera support)
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Acknowledgements PRIDE Team Attila Csordas Rui Wang Florian Reisinger Jose A. Dianes Tobias Ternent Yasset Perez-Riverol Noemi del Toro Henning Hermjakob
EU FP7 grant number 260558
PeptideAtlas Team (ISB, Seattle) Eric Deutsch Terry Farrah Zhi Sun Andrew R. Jones Lennart Martens Juan Pablo Albar Martin Eisenacher Gil Omenn And many other PX partners and stakeholders
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Overview • The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Miscellaneous
• Your questions for the discussion
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Questions for discussion Part A: Proteomic Dataset with PXD:
Q1. How to make deposited through ProteomeXchange or published data available for the consortium members as well as public DB managers (GPMDB, neXtProt, PeptideAtlas and ProteinAtlas)?
• There is plenty of documentation available:
• http://www.proteomexchange.com/submission
• PRIDE documentation
• Original paper (PMID: 24727771) and submission tutorial paper (PMID: 25047258).
• Resources need to be devoted to this…
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Questions for discussion Part A: Proteomic Dataset with PXD:
Q2. What can we do for such inaccessible datasets in the public DB?
• Contact the author directly and convince him/her to make a
submission to ProteomeXchange.
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Questions for discussion Part A: Proteomic Dataset with PXD:
Q3. Do we need to ask people to place a link with PXD identifier on the Wiki in order to see which chromosome team placed which datasets online (for sharing)?
• NO, in the PX submission tool it is possible to specify the tags for
HPP and/or C-HPP or B/D-HPP.
• Visit ‘ProteomeCentral’ and look for them.
• Visit PRIDE and look for them there (specific tags)
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
HPP datasets are now tagged The Projects are now tagged and can be browsed as a group of data sets.""
Tags for: HPP, C-HPP and B/D-HPP
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Questions for discussion Part B: Proteogenomic Dataset:
Q1. How to make easy access to or deposit proteogenomic dataset including RNAseq and other types of genetic data?
• Not easy to do at present. • NCBI and EBI have created “BioSamples databases” to be able to
link different studies performed using the same sample.
• Proteomics data could be submitted to PRIDE/ProteomeXchange and RNAseq data to e.g. ArrayExpress (EBI), linked by the same sample number.
• Sample IDs coming from the BioSamples DB to be integrated in the PX submission tool and in PRIDE (expected before the end of the year).
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Biosamples DB
http://www.ebi.ac.uk/biosamples/
Juan A. Vizcaíno
[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014
Fast file transfer with Aspera
- Aspera is the default file transfer protocol to PRIDE: - PX Submission tool - Command line
- Up to 50X faster than FTP
Part C: Difficulties in Connections: Q1. How to make the connections between local server and central DBs much faster and accessible (e.g. local server and ProteomeXchange)?
File transfer speed should not be a problem!!