proteomics repositories

51
Proteomics repositories Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-EBI Hinxton, Cambridge, UK

Upload: juan-antonio-vizcaino

Post on 15-Apr-2017

35 views

Category:

Science


0 download

TRANSCRIPT

EMBL-EBI Now and in the Future

Proteomics repositoriesDr. Juan Antonio Vizcano

Proteomics Team LeaderEMBL-EBIHinxton, Cambridge, UK

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Why sharing MS proteomics data?

Types of information stored in MS proteomics repositories.

Main existing repositories and their main characteristicsNo data reprocessingData reprocessingOther resources

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Corresponding public repositoriesGenomicsTranscript-omicsProteomicsDNA sequence databases (GenBank, EMBL, DDJB) ArrayExpress (EBI), GEO (NCBI)MS proteomics resources (ProteomeXchange)MetabolomicsMetaboLights (MetabolomeXchange)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Data sharing in ProteomicsProteomics data can be very complex and its interpretation is often troublesome and/or controversial.

In other omics fields, data sharing culture is well established. Generally, it is considered to be a good scientific practise.

In proteomics, the culture is definitely evolving in that direction. A big shift is happening in the last few years.

Scientific journals and funding agencies are two of the main drivers.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Reproducible Sciencehttp://www.nature.com/nature/focus/reproducibility/

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016What is a proteomics publication in 2016?Proteomics studies generate potentially large amounts of data and results.

Ideally, a proteomics publication needs to:Summarize the results of the studyProvide supporting information for reliability of any results reported

Information in a publication:ManuscriptSupplementary materialAssociated data submitted to a public repository

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Journal Submission RecommendationsJournal guidelines recommend and/or mandate submission to proteomics repositories:

ProteomicsNature BiotechnologyNature MethodsMolecular and Cellular Proteomics

Funding agencies are enforcing public deposition of data to maximize the value of the funds provided.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Why sharing MS proteomics data?

Types of information stored in MS proteomics repositories

Main existing repositories and their main characteristicsNo data reprocessingData reprocessingOther resources

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Main types of information stored1) Original experimental data recorded by the mass spectrometer (primary data) -. Raw data and peak lists.

2) Identification results inferred from the original primary data

3) Quantification information

4) Experimental and technical metadata

5) Any other type of information (e.g. scripts)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Current PSI Standard File Formats for MS

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Why sharing MS proteomics data?

Types of information stored in MS proteomics repositories.

Main existing repositories and their main characteristicsNo data reprocessingData reprocessingOther resources

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Proteomics repositories Many different workflows need to be supported. They provide complementary views.

No data reprocessing. Data is stored as published or originally analysed:PRIDE Archive (focused on MS/MS data, all supported)MassIVE (focused on MS/MS data) jPOST (focused on MS/MS data)PASSEL (only SRM data)

Data reprocessing (MS/MS data):PeptideAtlas and GPMDBproteomicsDB and HPM

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

ProteomeXchange: A Global, distributed proteomics database

PASSEL (SRM data)

PRIDE (MS/MS data)

MassIVE (MS/MS data)

Raw

ID/Q

Meta

jPOST(MS/MS data)

Mandatory raw data deposition since July 2015

Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.

http://www.proteomexchange.orgNew in 2016Vizcano et al., Nat Biotechnol, 2014Deutsch et al., NAR, 2017, in press

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Why sharing MS proteomics data?

Types of information stored in MS proteomics repositories.

Main existing repositories and their main characteristicsNo data reprocessingData reprocessingOther resources

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Resources that dont reprocess data1) Resources that try to represent the authors analysis view on the data.

Various workflows are allowed and they can provide complementary results.

Data are not updated in time. However, meta-analysis on top is possible.

Accumulation of FDRs when datasets are combined.

Main representatives: PRIDE Archive and MassIVE (MS/MS data) and PeptideAtlas/PASSEL (SRM data).

Data standards are essential.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Proteomics repositories Many different workflows need to be supported. They provide complementary views.

No data reprocessing. Data is stored as published or originally analysed:PRIDE Archive (focused on MS/MS data, all supported)MassIVE (focused on MS/MS data) jPOST (focused on MS/MS data)PASSEL (only SRM data)

Data reprocessing (MS/MS data):PeptideAtlas and GPMDBproteomicsDB and HPM.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

PRIDE stores mass spectrometry (MS)-based proteomics data:Peptide and protein expression data (identification and quantification)Post-translational modificationsMass spectra (raw data and peak lists)Technical and biological metadataAny other related information

Full support for tandem MS approachesAny type of data can be stored.

PRIDE (PRoteomics IDEntifications) Archivehttp://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcano et al., NAR, 2016

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

17

MassIVE (UCSD)

http://proteomics.ucsd.edu/service/massive/Data repository for MS proteomics dataTools available for users to analyse their own dataJoined ProteomeXchange on June 2014.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016http://massive.ucsd.edu http://proteomics.ucsd.edu MassIVE InteractivityMassIVE = Mass spectrometry Interactive Virtual Environment

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016MassIVE: Do it yourselfMSGF+ - Database search engineMSPLIT Spectral Library Search EngineENOSI ProteoGenomic Search EngineMODa- Multi-blind modification database search engineSpectral Networks spectral alignment-based analysis and propagation of identificationsMulti-pass - MSPLIT, MSGFDB, MODa cascade Search WorkflowMSGFDB - Database search engineMSPLIT-DIA Spectral Library Search for SWATHUpload your own! (mzIdentML, mzTab, TSV)

http://massive.ucsd.edu http://proteomics.ucsd.edu

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

jPOST Repository site(www.jpost.org)

Joined ProteomeXchange on July 2016

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Suitable for SRM assays

Use the PSI standard TraML plus the output of the most popular vendor pipelines

Started in 2012

Part of the ProteomeXchange consortium

http://www.peptideatlas.org/passel/Farrah et al., Proteomics, 2012PASSEL: repository for SRM data

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Why sharing MS proteomics data?

Types of information stored in MS proteomics repositories.

Main existing repositories and their main characteristicsNo data reprocessingData reprocessingOther resources

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Proteomics repositories Many different workflows need to be supported. They provide complementary views.

No data reprocessing. Data is stored as published or originally analysed:PRIDE Archive (focused on MS/MS data, all supported)MassIVE (focused on MS/MS data) jPOST (focused on MS/MS data)PASSEL (only SRM data)

Data reprocessing (MS/MS data):PeptideAtlas and GPMDBproteomicsDB and HPM.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Reprocessing repositoriesThese resources collect MS raw data and reprocess it using one given analysis pipeline, and an up to date protein sequence database.

Advantage: They provide a standardized and updated view on the experimental data available.

Only one common analysis method is used and there can be information loss.

Different from the authors view on the data.

Main resources: GPMDB and PeptideAtlas (ISB, Seattle).

Novel resources: proteomicsDB and the Human Proteome Map.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

http://www.peptideatlas.org

Developed at the Institute for Systems Biology (ISB, Seattle, USA)

Peptide identifications from MS/MS approaches

Data are reprocessed using the popular Trans Proteomic Pipeline (TPP)

- Uses PeptideProphet to derive a probability for the correct identification for all contained peptides

PeptideAtlas

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016All peptides IDs are mapped to Ensembl using ProteinProphet (to handle protein inference)

Provides proteotypic peptide predictions

Limited metadata available

Part of the HPP project

Deutsch et al., Proteomics, 2005Desiere et al., NAR, 2006.Deutsch et al., EMBO Rep, 2008PeptideAtlas

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Builds are updated in a regular basis (usually once a year)

Examples of builds:

- Human (HPP context) Human plasma Human urine Drosophila Mouse Mouse plasma Cow Yeast

PeptideAtlas builds

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Originally developed by R. Beavis & R. Craig

End point of the GPM proteomics pipeline, to aid in the process of validating peptide MS/MS spectra and protein coverage patterns.

http://gpmdb.thegpm.org/Craig et al., J Proteome Res, 2004GPMDB (Global Proteome Machine DB)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Data are reprocessed using the popular X!Tandem or X!Hunter spectral searching algorithm

Also provides proteotypic peptides

GPMDB

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Nice visualization features

Provides very limited annotation with GO, BTO

Some support to targeted approaches is available

Part of the HPP consortium

GPMDB

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

http://thehpp.org/The Human Proteome Project (HPP)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016HPP guidelines version 2.1

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

33

Proteomics repositories Many different workflows need to be supported. They provide complementary views.

No data reprocessing. Data is stored as published or originally analysed:PRIDE Archive (focused on MS/MS data, all supported)MassIVE (focused on MS/MS data) jPOST (focused on MS/MS data)PASSEL (only SRM data)

Data reprocessing (MS/MS data):PeptideAtlas and GPMDBproteomicsDB and HPM

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014Kim et al., Nature, 2014

Two independent groups claimed to have produced the first complete draft of the human proteome by MS.

Some of their findings are controversial and need further validation but generated a lot of discussion and put proteomics in the spotlight.

Two proteomics resources have been developed: proteomicsDB and the Human Proteome Map (HPM).Nature cover 29 May 2014

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016ProteomicsDBhttps://www.proteomicsdb.org/

Data analysis using Mascot and MaxQuant

The way the Protein FDR is calculated is controversial

Quantification information using label free techniques

New datasets are added in a regular basis

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016ProteomicsDB (2)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Human Proteome Map (HPM) Developed by the Pandey group.

Data reanalysis using Mascot.

Protein FDR is not mentioned at all in the corresponding Nature paper.

Static resource: it will not be updated any longer.

http://www.humanproteomemap.org/

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Why sharing MS proteomics data?

Types of information stored in MS proteomics repositories.

Main existing repositories and their main characteristicsNo data reprocessingData reprocessingOther resources

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Chorushttps://chorusproject.org/pages/index.htmlDeveloped by M. MacCoss group in Seattle (UW).

Built on top of Amazon Cloud technologies

Provides data analysis capabilities for the users

Free for public datasets.

The objective is to connect the data to analysis tools in a cloud environment

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

MaxQB

Human ProteinpediaOther repositories

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016COPaKBCardiac Organellar Protein Atlas Knowledgebase

International collaboration (EMBL-EBI involved)Windows Client and iPad AppSubmit data for analysis in dta and mzML formatsData submitted to a ProLuCID pipelineNo MS data download

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016CPTAC data portal

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Pep2pro (Arabidopsis)http://fgcz-pep2pro.uzh.ch/Centered on Arabidopsis dataDownload spectra by spectraQuantitative informationLinked to gelmap.de (2DE)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016FINAL THOUGHTS

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Why are repositories not more popular?Dont want to share data

Researchers dont like to be shown that they did not analyze the data as well as they could have.Their FDR may be higher than they reported/think.

Researchers are worried that they missed something in the data that they could discover if they go back to it at a later dateDont want other authors to get a publication from their data.However, this philosophy is changing gradually

Slide from R. Chalkley

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Why are repositories not more popular? (2)2.Submission burdenGetting data into correct format may require some workAuthor is not necessarily computer-savvy

Having to also supply metadata is seen as a burden, if the information is already present in an associated manuscript

Associated raw data may be many GB in size; file transfer to repository could take a while

Authors are impatient: want to spend time doing science, not administration!

Slide from R. Chalkley

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Importance of sharing MS proteomics data

The main existing proteomics repositories are complementary in focus and functionality.

Main characteristics of:

PeptideAtlas and GPMDB (Reprocess data)PASSEL, MassIVE, jPOST and PRIDE Archive (at present they do not reprocess data).New resources: proteomicsDB, HPM.Chorus, CPTAC portal,Conclusions

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Reproducible Science

http://www.nature.com/nature/focus/reproducibility/

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Perez-Riverol et al., Proteomics, 2015. PMID: 25158685

Recommended reading

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Questions?

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 201651