proteomexchange_and_pride_semmeting_2015

50
PRIDE and ProteomeXchange: Share and explore public proteomics datasets like never before Dr. Juan Antonio Vizcaíno PRIDE Group Coordinator Proteomics Services Team EMBL-EBI Hinxton, Cambridge, UK

Upload: juan-antonio-vizcaino

Post on 20-Jul-2015

174 views

Category:

Science


0 download

TRANSCRIPT

PRIDE and ProteomeXchange: Share and

explore public proteomics datasets like never

before

Dr. Juan Antonio Vizcaíno

PRIDE Group Coordinator

Proteomics Services Team

EMBL-EBI

Hinxton, Cambridge, UK

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

• PRIDE Archive (in the context of ProteomeXchange and

the PSI standards)

• How to submit data to PRIDE: PRIDE tools

• ProteomeCentral, submission and access stats

Overview

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Data sharing in Proteomics

• Public availability of data in proteomics enables:

• Reinterpretation (e.g. data reprocessing with different aims)

• Improved analysis software.

• Change in protein sequence databases (e.g. proteogenomics

studies).

• Consider new post-translational modifications.

• validation of the experimental results reported.

• Specific use cases for proteomics: spectral libraries,

fragmentation models, SRM transitions,…

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE (PRoteomics IDEntifications) database

http://www.ebi.ac.uk/pride

• PRIDE stores mass spectrometry (MS)-

based proteomics data:

• Peptide and protein expression data

(identification and quantification)

• Post-translational modifications

• Mass spectra (raw data and peak lists)

• Technical and biological metadata

• Any other related information

• Full support for tandem MS approaches

Martens et al., Proteomics, 2005

Vizcaíno et al., NAR, 2013

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE Archive

• New PRIDE DB archival system from 01/2014. Three iterations

released so far. Still work in progress.

• Very flexible, its development has happened in parallel with:

• Implementation of ProteomeXchange.

• New community PSI data standards: mzIdentML, mzQuantML and

mzTab.

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

ProteomeXchange Consortium

• Goal: Development of a framework to allow standard

data submission and dissemination pipelines

between the main existing proteomics repositories.

• Includes PeptideAtlas (ISB, Seattle), PRIDE

(Cambridge, UK) and MassIVE (UCSD, San Diego).

• Tranche and Peptidome initially included but

discontinued.

• Common identifier space (PXD identifiers)

• Two supported data workflows: MS/MS and SRM.

• Main objective: Make life easier for researchers

http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Current PSI Standard File Formats for MS

• mzTab (Griss et al., MCP, 2014)Final Results

• TraML (Deutsch et al., MCP, 2012)SRM

• mzQuantML (Walter et al., MCP, 2013)Quantitation

• mzIdentML (Jones et al., MCP, 2012)Identification

• mzML (Martens et al., MCP, 2011)MS data

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Current PSI Standard File Formats for MS

• mzTab (Griss et al., MCP, 2014)Final Results

• TraML (Deutsch et al., MCP, 2012)SRM

• mzQuantML (Walter et al., MCP, 2013)Quantitation

• mzIdentML (Jones et al., MCP, 2012)Identification

• mzML (Martens et al., MCP, 2011)MS data

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

mzTab format: tab delimited format (ident/quant)

http://code.google.com/p/mztab/

J. Griss et al., MCP, 2014

Q.W. Xu et al., Proteomics, 2014

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Ways to access data in PRIDE Archive

• PRIDE web interface

• File repository

• REST web service

• PRIDE Inspector tool

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE Archive web interface

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE Archive web interface (2)

• Next: visualization of

spectra (in a couple of

weeks)

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Programmatic access: PRIDE REST web service

http://www.ebi.ac.uk/pride/ws/archive/

• Intending to replace the

most popular functionality

provided by the PRIDE

Biomart interface (now

discontinued)

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

• Introduction to PRIDE Archive (in the context of

ProteomeXchange and the PSI standards)

• How to submit data to PRIDE: PRIDE tools

• ProteomeCentral, submission and access stats

• A sneak peak about data reuse

Overview

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

ProteomeCentral

Metadata /

Manuscript

Raw Data*

Results

Journals

UniProt/

neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL

(SRM data)

PRIDE

(MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE

(MS/MS data)

ProteomeXchange data workflow: PRIDE

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Manuscript published detailing the process

Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission

Example dataset:

PXD000764

- Title: “Discovery of new CSF biomarkers for meningitis in children”

- 12 runs: 4 controls and 8 infected samples

- Identification and quantification data

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or peak list

spectra in a standardized format (mzML, mzXML).

2. Result files:

a. Complete submissions: Result files can be converted to

PRIDE XML or the mzIdentML data standard.

b. Partial submissions: For workflows not yet supported by

PRIDE, search engine output files will be stored and provided in

their original form.

3. Metadata: Sufficiently detailed description of sample origin,

workflow, instrumentation, submitter.

4. Other files: Optional files:

a. QUANT: Quantification related results e. FASTA

b. PEAK: Peak list files f. SP_LIBRARY

c. GEL: Gel images

d. OTHER: Any other file type

Published

RawFiles

Other files

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Complete vs Partial submissions: experimental metadata

Complete Partial

General experimental metadata about the projects is similar.

However, at the assay level information in partial submissions is not so detailed

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Complete

Partial

Complete vs Partial submissions: processed results

For complete submissions, it is possible to connect the spectra with the identification

processed results and they can be visualized.

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or peak list

spectra in a standardized format (mzML, mzXML).

2. Result files:

a. Complete submissions: Result files can be converted to

PRIDE XML or the mzIdentML data standard.

b. Partial submissions: For workflows not yet supported by

PRIDE, search engine output files will be stored and provided in

their original form.

3. Metadata: Sufficiently detailed description of sample origin,

workflow, instrumentation, submitter.

4. Other files: Optional files (the list can be extended):

a. QUANT: Quantification related results e. FASTA

b. PEAK: Peak list files f. SP_LIBRARY

c. GEL: Gel images

d. OTHER: Any other file type

Published

RawFiles

Other files

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE Components: Submission Process

PRIDE Converter 2

PRIDE Inspector PX Submission Tool

mzIdentML

PRIDE XML

1

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Search

output

files

Spectra

files

Original data files ‘RESULT’ file generation Final ‘RESULT’ file

PRIDE

XML

‘RESULT’

Before: only file conversion to PRIDE XML

File conversion

PRIDE

Converter

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Tools ‘RESULT’ file generation Final ‘RESULT’ file

mzIdentML

‘RESULT’

Now: native file export

Spectra

files

Mascot

ProteinPilot

Scaffold

PEAKS

MSGF+

Others

Native File export

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Complete submissions

Search

Engine

Results + MS

files

Search

engines

mzIdentML

- Mascot

- MSGF+

- Myrimatch and related tools from D. Tabb’s lab

- OpenMS

- PEAKS

- PeptideShaker

- ProCon (ProteomeDiscoverer, Sequest)

- Scaffold

- TPP via the idConvert tool (ProteoWizard)

- ProteinPilot (from version 5.0)

- Others: library for X!Tandem conversion, lab

internal pipelines, …

- Crux

An increasing number of tools support export to mzIdentML 1.1

- Referenced spectral files need to be submitted as well

(all open formats are supported).

Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Tools ‘RESULT’ file generation Final ‘RESULT’ file

mzTab

‘RESULT’

In the near future: native file export

Spectra

files

Mascot

ProteinPilot

Scaffold

PEAKS

MSGF+

Others

Native File export

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE Components: Submission Process

PRIDE Converter 2

PRIDE Inspector PX Submission Tool

mzIdentML

PRIDE XML

2

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE Inspector 2

Wang et al., Nat. Biotechnology, 2012

PRIDE Inspector 2

PRIDE Inspector 2 supports:

- PRIDE XML

- mzIdentML + all types of spectra files

- mzML- mzTab Quantitation (work in progress)

https://github.com/PRIDE-Toolsuite/

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE Inspector 2PRIDE Inspector 2

https://github.com/PRIDE-Toolsuite/

New visualisation

functionality for Protein

Groups

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE Inspector 2PRIDE Inspector 2

Private review of files

submitted to PRIDEhttps://github.com/PRIDE-Toolsuite/

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE Components: Submission Process

PRIDE Converter 2

PRIDE Inspector PX Submission Tool

mzIdentML

PRIDE XML

3

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

• It selects and captures the mappings between the different types of files included in the

submission.

• It transfers all the files using Aspera (default) or FTP.

PX submission tool

Published

Raw

Other files

http://www.proteomexchange.org/submission

PX

submission

tool

• Command line alternative: some scripting is needed

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PX submission tool: screenshots

Step 3

Step 4

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Fast file transfer with Aspera

- Aspera is the default file transfer protocol to PRIDE:

- PX Submission tool

- Command line

- Up to 50X faster than FTPFile transfer speed should not

be a problem!!

- Also now available for downloading files

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Partial submissions can be used to store other data workflows

• Everything can be stored, not only MS/MS data (~90% of datasets):

very flexible mechanism to be able to capture all types of datasets

• PRIDE does not store SRM data (it goes to PASSEL)

• Top down proteomics datasets (10 datasets).

• Mass Spectrometry Imaging datasets (1 dataset).

• Data independent acquisition techniques: e.g. SWATH-MS (9

datasets), HDMSE (1 dataset).

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

C

D

From original publication [13] Reconstructed ProteomeXchange data

1. Thermo RAW data / UDP

2. Mirion Software (JLU)

1. Thermo RAW data / UDP

2. Convert to imzML

3. Upload to PRIDE

(EBI, Cambridge, UK)

4. Download from PRIDE

5. Display in MSiReader

- Vendor-independent data format

- Freely available software (open source)

- ‘open data‘ – free to reuse

- Anybody can do this!

Römpp et al., 2014, Anal Bioanal Chem, in press

PRIDE database

European

Bioinformatics

Institute,

Cambridge, UK

3. Upload

4. Download

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

• Introduction to PRIDE Archive (in the context of

ProteomeXchange and the PSI standards)

• How to submit data to PRIDE: PRIDE tools

• ProteomeCentral, submission and access stats

Overview

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

ProteomeCentral

Metadata /

Manuscript

Raw Data*

Results

Journals

UniProt/

neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL

(SRM data)

PRIDE

(MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE

(MS/MS data)

ProteomeXchange data workflow: ProteomeCentral

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

ProteomeCentral: Portal for all PX datasets

http://proteomecentral.proteomexchange.org/cgi/GetDataset

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

ProteomeCentral: Portal for all PX datasets

http://proteomecentral.proteomexchange.org/cgi/GetDataset

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

RSS feed for public datasets

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Origin: 322 USA197 Germany148 United Kingdom

91 Netherlands85 France81 China80 Switzerland 61 Canada48 Belgium47 Spain45 Denmark42 Australia 40 Japan37 Sweden28 Austria22 India21 Norway21 Taiwan20 Ireland20 Finland17 Italy14 Brazil13 Republic of Korea13 Russia10 Israel

9 Singapore …

ProteomeXchange: 1620 datasets up until 8th January 2015

Type:

526 PRIDE complete (32.5%)

982 PRIDE partial (60.6%)

63 PeptideAtlas/PASSEL complete

24 MassIVE

25 reprocessed

Publicly Accessible:

814 datasets, 50% of all

90% PRIDE

8% PASSEL

2% MassIVE

Data volume:

Total: ~71 TB

Number of all files: ~160,000

PXD000320-324: ~ 5 TB

PXD000065: ~ 1.4TB

Top Species studied by at least 10 datasets:

712 Homo sapiens

193 Mus musculus

65 Saccharomyces cerevisiae

61 Arabidopsis thaliana

35 Rattus norvegicus

34 Escherichia coli

17 Bos taurus

17 Glycine max

17 Mycobacterium tuberculosis

16 Drosophila melanogaster

14 Oryza sativa

~ 310 species in total

Datasets/year:

2012: 102

2013: 527

2014: 963

2015: 28

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

PRIDE: Submitted datasets per month

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Access statistics: PRIDE File repository

2014: The rise of proteomics data re-use

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Which are the most accessed datasets?

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Vaudel M, Barsnes H, Berven FS, Sickmann A, Martens L:

Proteomics 2011;11(5):996-9.

http://searchgui.googlecode.com http://peptide-shaker.googlecode.com

Vaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L, Barsnes

H:

Nature Biotechnology 2015; 33(1):22-4.

PeptideShaker facilitates reuse of PRIDE data

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014

•Around 60% of the data used for the

analysis comes from previous experiments,

most of them stored in proteomics repositories

such as PRIDE/ProteomeXchange, PASSEL

or MassIVE.

•They complement that data with “exotic”

tissues.

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

• Data submission and data reuse in the field are rising.

• PRIDE and ProteomeXchange enable this for you.

• Data standards are key for us.

• Quantification data depends on mzTab support.

Conclusions

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Aknowledgements: People

Attila Csordas

Tobias Ternent

Noemi del Toro

Rui Wang

Florian Reisinger

Jose A. Dianes

Johannes Griss

Steven Lewis

Yasset Perez-Riverol

Henning Hermjakob

All ProteomeXchange partners,

especially Eric Deutsch and Nuno

Bandeira

Acknowledgements: The PRIDE Team and collaborators

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Acknowledgements: Funding

[email protected]

[email protected]

http://www.proteomexchange.org

http://code.google.com/p/pride-converter-2/

@pride_ebi

Acknowledgements

Juan A. Vizcaí[email protected]

Midwinter Proteomics Bioinformatics SeminarSemmering, 15 January 2015

Questions?