ahupo_vizcaino_remote_presentation_082014

46
ProteomeXchange: Update for the C- HPP Consortium Dr. Juan Antonio Vizcaíno PRIDE Group Coordinator Proteomics Services Team EMBL-EBI Hinxton, Cambridge, UK

Upload: juan-antonio-vizcaino

Post on 30-Jun-2015

113 views

Category:

Science


0 download

DESCRIPTION

ProteomeXchange: Update for the C-HPP Consortium. 10th C-HPP Workshop: “Proteome data management and identification of missing proteins". Bangkok, Thailand. 09/08/2015. Remote presentation.

TRANSCRIPT

Page 1: AHUPO_Vizcaino_remote_presentation_082014

ProteomeXchange: Update for the C-HPP Consortium

Dr. Juan Antonio Vizcaíno PRIDE Group Coordinator

Proteomics Services Team

EMBL-EBI

Hinxton, Cambridge, UK

Page 2: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Overview •  The ProteomeXchange (PX) consortium

•  How to submit and access data in PX via PRIDE

•  How to access PX data

•  Miscellaneous

•  Your questions for the discussion

Page 3: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Overview •  The ProteomeXchange (PX) consortium

•  How to submit and access data in PX via PRIDE

•  How to access PX data

•  Miscellaneous

•  Your questions for the discussion

Page 4: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

ProteomeXchange Consortium •  Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.

•  Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego).

•  EU FP7 CA (01/2011-> 06/2014).

•  Common identifier space (PXD identifiers)

•  Two supported data workflows: MS/MS and SRM.

•  Main objective: Make life easier for researchers

http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014

Page 5: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

ProteomeCentral

Metadata / Manuscript

Raw Data*

Results

Journals

UniProt/ neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE (MS/MS data)

Vizcaíno et al., Nat Biotechnol, 2014

ProteomeXchange data workflow

Page 6: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

MassIVE (UCSD)

http://proteomics.ucsd.edu/service/massive/

•  Just joined ProteomeXchange on June 2014 •  Similar role to PRIDE (although not yet formalised).

Page 7: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

ProteomeCentral

Metadata / Manuscript

Raw Data*

Results

Journals

UniProt/ neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE (MS/MS data)

Vizcaíno et al., Nat Biotechnol, 2014

ProteomeXchange data workflow

Page 8: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Overview •  The ProteomeXchange (PX) consortium

•  How to submit and access data in PX via PRIDE

•  How to access PX data

•  Miscellaneous

•  Your questions for the discussion

Page 9: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

PRIDE (PRoteomics IDEntifications) database

http://www.ebi.ac.uk/pride

• Focused on MS/MS

approaches

Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2013

Page 10: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Manuscript just out detailing the process

Ternent et al., Proteomics, 2014, in press

http://www.proteomexchange.org/submission

Page 11: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Complete vs Partial submissions

•  The challenge: PRIDE is an archival database, aiming to provide long term access to proteomics data from all workflows

•  Huge variety of proteomics workflows and file formats establishes a data management nightmare

•  Previously, we had to decline submissions since there were in formats we could not handle.

•  Solution: Complete and Partial submissions

•  Metadata, raw data, results are mandatory in both cases, just not parsed for partial submissions

•  Complete submission •  All data in standard formats,

accessible through PRIDE Inspector and web interface (not yet)

•  Results searchable in DB

•  Submission gets DOI

•  Partial submission •  Part of data in non-standard formats

•  Files are made available to download

•  Only metadata searchable

•  No DOI

Page 12: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

PX Data workflow for MS/MS data 1.  Mass spectrometer output files: raw data (binary files) or

peak list spectra in a standardized format (mzML, mzXML).

Raw  Files  

Page 13: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

1.  Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML).

2.  Result files:

a.  Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.

b.  Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.

PX Data workflow for MS/MS data

Published    

Raw  Files  

Page 14: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Complete submissions Search Engine

Results + MS files

Search engines

mzIdentML

-  Mascot -  MSGF+ -  Myrimatch and related tools from D. Tabb’s lab -  OpenMS -  PEAKS -  ProCon (ProteomeDiscoverer, Sequest) -  Scaffold -  TPP via the idConvert tool (ProteoWizard) -  ProteinPilot (planned by the end of 2014) -  Others: library for X!Tandem conversion, lab internal pipelines, …

An increasing number of tools support export to mzIdentML 1.1

-  Referenced spectral files need to be submitted as well

Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.

Page 15: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Available for complete submissions

Wang et al., Nat. Biotechnology, 2012

PRIDE Inspector 2.0

PRIDE Inspector 2.0 supports: -  PRIDE XML -  mzIdentML + all types of spectra files -  mzML -  mzTab Ident (work in progress)

http://code.google.com/p/pride-toolsuite/wiki/PRIDEInspector

Page 16: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

PX Data workflow for MS/MS data 1.  Mass spectrometer output files: raw data (binary files) or

peak list spectra in a standardized format (mzML, mzXML).

2.  Result files:

a.  Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.

b.  Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.

3.  Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.

Published    

Raw  Files  

Page 17: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

PX Data workflow for MS/MS data 1.  Mass spectrometer output files: raw data (binary files) or

peak list spectra in a standardized format (mzML, mzXML).

2.  Result files:

a.  Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.

b.  Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.

3.  Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.

4.  Other files: Optional files: a.  QUANT: Quantification related results e. FASTA b.  PEAK: Peak list files f. SP_LIBRARY c.  GEL: Gel images d.  OTHER: Any other file type

Published    

Raw  Files  

Other  files  

Page 18: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

 •  Capture the mappings between the different types of files.

•  Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP).

PX submission tool

Published    

Raw  

Other  files  

http://www.proteomexchange.org/submission

PX submission

tool

 •  Command line alternative: some scripting is needed

Page 19: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Fast file transfer with Aspera

- Aspera is the default file transfer protocol to PRIDE: - PX Submission tool - Command line

- Up to 50X faster than FTP

Part C: Difficulties in Connections: Q1. How to make the connections between local server and central DBs much faster and accessible (e.g. local server and ProteomeXchange)?

File transfer speed should not be a problem!!

Page 20: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

PX submission tool

Page 21: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

PX submission tool: HPP tags

Page 22: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Complete vs Partial submissions

•  The challenge: PRIDE is an archival database, aiming to provide long term access to proteomics data from all workflows

•  Huge variety of proteomics workflows and file formats establishes a data management nightmare

•  Previously, we had to decline submissions since there were in formats we could not handle.

•  Solution: Complete and Partial submissions

•  Metadata, raw data, results are mandatory in both cases, just not parsed for partial submissions

•  Complete submission •  All data in standard formats,

accessible through PRIDE Inspector and web interface (not yet)

•  Results searchable in DB

•  Submission gets DOI

•  Partial submission •  Part of data in non-standard formats

•  Files are made available to download

•  Only metadata searchable

•  No DOI

Page 23: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Partial submissions •  Everything can be stored, not only MS/MS data: very flexible

mechanism to be able to capture all types of datasets

•  Top down proteomics datasets

•  Mass Spectrometry Imaging datasets

•  Data independent acquisition techniques: e.g. SWATH-MS datasets, among other DIA approaches

Page 24: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Origin: 235 USA 142 Germany 97 United Kingdom 67 Switzerland 64 Netherlands 62 China 60 France 48 Canada 43 Spain 36 Belgium 32 Sweden 29 Australia 26 Denmark 23 Japan 18 Taiwan 17 India 16 Ireland 14 Norway 14 Italy 12 Finland 11 Republic of Korea 10 Brazil 8 Austria 7 Israel 7 Singapore …

ProteomeXchange: 1,148 datasets up until August 2014

Type: 386 PRIDE complete 687 PRIDE partial 51 PeptideAtlas/PASSEL complete 1 MassIVE 23 reprocessed

Publicly Accessible: 544 datasets, 50% of all 90% PRIDE 10% PASSEL

Data volume: Total: ~51 TB Number of all files: ~130,000 PXD000320-324: ~ 5 TB PXD000065: ~ 1.4TB

Top Species studied by at least 10 datasets: 510 Homo sapiens 142 Mus musculus 46 Saccharomyces cerevisiae 45 Arabidopsis thaliana 23 Rattus norvegicus 16 Escherichia coli 15 Bos taurus 15 Mycobacterium tuberculosis 13 Oryza sativa 12 Drosophila melanogaster 12 Glycine max ~ 265 species in total

Datasets/year: 2012: 102 2013: 527 2014: 519

Page 25: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Overview •  The ProteomeXchange (PX) consortium

•  How to submit and access data in PX via PRIDE

•  How to access PX data

•  Miscellaneous

•  Your questions for the discussion

Page 26: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

ProteomeCentral

Metadata / Manuscript

Raw Data*

Results

Journals

UniProt/ neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE (MS/MS data)

Vizcaíno et al., Nat Biotechnol, 2014

ProteomeXchange data workflow

Page 27: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

ProteomeCentral: Portal for all PX datasets

http://proteomecentral.proteomexchange.org/cgi/GetDataset

Page 28: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

ProteomeCentral: Portal for all PX datasets

http://proteomecentral.proteomexchange.org/cgi/GetDataset

Page 29: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Get notified about new PX datasets

- Subscribe to the RSS Feed to receive information about the new datasets:

http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml

Proteome Central Researchers

Page 30: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Reuse of datasets in PeptideAtlas can be tracked

Page 31: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Overview •  The ProteomeXchange (PX) consortium

•  How to submit and access data in PX via PRIDE

•  How to access PX data

•  Miscellaneous

•  Your questions for the discussion

Page 32: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

HPP datasets are now tagged The Projects are now tagged and can be browsed as a group of data sets.""

Tags for: HPP, C-HPP and B/D-HPP

Page 33: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

HPP PX datasets

Page 34: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

HPP PX datasets: some numbers Since January 2014, we started capturing the PI information -  25 HPP datasets: 22 C-HPP and 3 B/D-HPP

-  Countries represented in C-HPP: -  5 Spain -  4 South Korea -  3 Brazil, China Only a small proportion of the datasets have been made publicly available, at least through ProteomeXchange

Page 35: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

For the near future in PRIDE…

•  Complete data workflow for data visualization in PRIDE Archive web

•  Improvement of existing PRIDE REST API

•  Incorporation of reprocessed data in PRIDE (in collaboration with Prof. L. Martens (VIB/Ghent) and Dr. A. Jones (U. Liverpool)

•  Integration of the data in the EBI Molecular Atlas

Page 36: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Conclusions •  ProteomeXchange is widely used. It has now a new

consortium member: MassIVE (UCSD).

•  PRIDE is getting a lot of data via ProteomeXchange. Pipeline in production since summer 2012. More than 1,100 datasets have been already submitted.

•  Half of them already public.

•  Different open source tools available to facilitate the process: •  File transfer speed should not be a problem (Aspera support)

Page 37: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Acknowledgements PRIDE Team Attila Csordas Rui Wang Florian Reisinger Jose A. Dianes Tobias Ternent Yasset Perez-Riverol Noemi del Toro Henning Hermjakob

EU FP7 grant number 260558

PeptideAtlas Team (ISB, Seattle) Eric Deutsch Terry Farrah Zhi Sun Andrew R. Jones Lennart Martens Juan Pablo Albar Martin Eisenacher Gil Omenn And many other PX partners and stakeholders

Page 38: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Questions?

Page 39: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Overview •  The ProteomeXchange (PX) consortium

•  How to submit and access data in PX via PRIDE

•  How to access PX data

•  Miscellaneous

•  Your questions for the discussion

Page 40: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Questions for discussion Part A: Proteomic Dataset with PXD:

Q1. How to make deposited through ProteomeXchange or published data available for the consortium members as well as public DB managers (GPMDB, neXtProt, PeptideAtlas and ProteinAtlas)?

•  There is plenty of documentation available:

•  http://www.proteomexchange.com/submission

•  PRIDE documentation

•  Original paper (PMID: 24727771) and submission tutorial paper (PMID: 25047258).

•  Resources need to be devoted to this…

Page 41: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Questions for discussion Part A: Proteomic Dataset with PXD:

Q2. What can we do for such inaccessible datasets in the public DB?

•  Contact the author directly and convince him/her to make a

submission to ProteomeXchange.

Page 42: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Questions for discussion Part A: Proteomic Dataset with PXD:

Q3. Do we need to ask people to place a link with PXD identifier on the Wiki in order to see which chromosome team placed which datasets online (for sharing)?

•  NO, in the PX submission tool it is possible to specify the tags for

HPP and/or C-HPP or B/D-HPP.

•  Visit ‘ProteomeCentral’ and look for them.

•  Visit PRIDE and look for them there (specific tags)

Page 43: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

HPP datasets are now tagged The Projects are now tagged and can be browsed as a group of data sets.""

Tags for: HPP, C-HPP and B/D-HPP

Page 44: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Questions for discussion Part B: Proteogenomic Dataset:

Q1. How to make easy access to or deposit proteogenomic dataset including RNAseq and other types of genetic data?

•  Not easy to do at present. •  NCBI and EBI have created “BioSamples databases” to be able to

link different studies performed using the same sample.

•  Proteomics data could be submitted to PRIDE/ProteomeXchange and RNAseq data to e.g. ArrayExpress (EBI), linked by the same sample number.

•  Sample IDs coming from the BioSamples DB to be integrated in the PX submission tool and in PRIDE (expected before the end of the year).

Page 45: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Biosamples DB

http://www.ebi.ac.uk/biosamples/

Page 46: AHUPO_Vizcaino_remote_presentation_082014

Juan A. Vizcaíno

[email protected] 10th C-HPP Workshop Bangkok, 9 August 2014

Fast file transfer with Aspera

- Aspera is the default file transfer protocol to PRIDE: - PX Submission tool - Command line

- Up to 50X faster than FTP

Part C: Difficulties in Connections: Q1. How to make the connections between local server and central DBs much faster and accessible (e.g. local server and ProteomeXchange)?

File transfer speed should not be a problem!!