Download - Reuse of public proteomics data

Reuse of public proteomics data

Dr. Juan Antonio Vizcaíno

EMBL-EBI

Hinxton, Cambridge, UK

Juan A. Vizcaí[email protected]

WT Proteomics Bioinformatics Course 2016Hinxton, 20 July 2017

Countries with at least

100 submitted datasets :

1019 USA

734 Germany

492 United Kingdom

470 China

273 France

209 Netherlands

173 Canada

165 Switzerland

157 Australia

148 Austria

142 Denmark

137 Spain

115 Sweden

109 Japan

100 India

5,198 ProteomeXchange datasets in PRIDE (by March 2017)

Type:

3835 ‘Partial’ submissions (73.8%)

1363 ‘Complete’ submissions (26.2%)

Released: 3462 datasets (66.6%)

Unpublished: 1,736 datasets (33.4%)

Data volume in PRIDE:

Total: ~320 TB

Number of files: ~670,000

PXD000320-324: ~ 4 TB

PXD002319-26 ~2.4 TB

PXD001471 ~1.6 TB

Top Species represented (at least

100 datasets):

2267 Homo sapiens

765 Mus musculus

201 Saccharomyces cerevisiae

169 Arabidopsis thaliana

154 Rattus norvegicus

124Escherichia coli

~ 1000 species in total



Public proteomics datasets are being increasingly

reused…

Martens & Vizcaíno, Trends Bioch Sci, 2017



Downloads are increasing

Data download volume for PRIDE Archive in 2016: 243 TB

0

50

100

150

200

250

300

2013 2014 2015 2016

Downloads in TBs



Data sharing in Proteomics

Vaudel et al., Proteomics, 2016



What do researchers think about the main uses

of public proteomics data?

• Verify published results

• Build Spectral libraries

• Find interesting datasets for reanalysis:

• Find new splice isoforms

• Discover new PTMs

Source: MassIVE workshop, ASMS 2017, Nuno Bandeira et al.




Vaudel et al., Proteomics, 2016




• Data directly as they are.

• Protein knowledge bases: UniProt, neXtProt.

• Contributing to the Protein Evidence Code.



Protein Evidence codes in UniProt/neXtProt

http://www.uniprot.org/help/protein_existence

http://www.uniprot.org/help/protein_existence



Use of MS data in UniProt



Reuse

• Information is not only extracted, but reused in new

experiments with the potential of generating new

knowledge.

• Transitions used in SRM approaches.

• Spectral libraries.



SRMAtlas

http://www.srmatlas.org/

http://www.srmatlas.org/



PeptidePicker

http://mrmpeptidepicker.proteincentre.com/

http://mrmpeptidepicker.proteincentre.com/



Spectral searching

• Concept: To compare experimental spectra to other

experimental spectra.

• There are many spectral libraries publicly available (for

instance, from NIST, PeptideAtlas and PRIDE)

• Custom ‘search engines’ have been developed:

• SpectraST (TPP)

• X!Hunter (GPM)

• Bibliospec

• It has been claimed that the searches have more

sensitivity that with sequence database approaches



Spectral searching (2)

http://peptide.nist.gov/

http://peptide.nist.gov/



PRIDE Cluster as a Public Data Mining Resource

18

• http://www.ebi.ac.uk/pride/cluster

• Spectral libraries for 16 species.

• All clustering results, as well as specific subsets of interest available.

• Source code (open source) and Java API

http://www.ebi.ac.uk/pride/cluster



Benchmarking

• Many datasets are reused to benchmark new algorithms

and/or tools. Comparison with previous tools.

• Usually raw data is used as the base (new analysis), but

not always.

• Very common reuse case -> A lot of examples.



Reprocess

• Data are reprocessed with the intention of obtaining

new knowledge or to provide an updated view on

the results.

• It mainly serves the same purpose of the original

experiment.

• For instance, a shot-gun dataset can be reprocessed

with a different algorithm or an updated sequence

database.



Reprocessing repositories

• These resources collect MS raw data and reprocess it using

one given analysis pipeline, and an up-to date protein

sequence database.

• Main resources: GPMDB and PeptideAtlas (ISB, Seattle).



PeptideAtlas and GPMDB



proteomicsDB -> Draft Human proteome paper

Wilhelm et al., Nature, 2014

•Around 60% of the data used for the

analysis comes from previous

experiments, most of them stored in

proteomics repositories such as

PRIDE/ProteomeXchange, PASSEL or

MassIVE.

•They complement that data with “exotic”

tissues.



Reprocessing for the validation of controversial data

• Analysis of Tyrannosaurus rex fossils: controversial presence of

collagen (is it a contamination of the sample? Did the sample contain

any T. rex proteins at all?)

Asara et al. (2007) Science 316: 280-5.

Asara et al. (2007) Science 316: 1324-5.

Bern et al. (2009) JPR 9: 4328-32

PRIDE Archive assay accession

8633



Info from R. Chalkley

Bromenshenk et al. (2011) PLOS One 5: e13181

Reprocessing for the validation of controversial data (2)



Experimental Protocol

1. Collected samples from healthy, collapsing and collapsed bee colonies.

2. Homogenised bees.

3. Digested with Trypsin

4. Analyzed by LC-MSMS on LTQ

5. Searched using Sequest

6. Filtered Results using Peptide and Protein Prophet

7. Performed further analysis to determine species statistically more

commonly found in collapsing/collapsed colony samplesInfo from R. Chalkley

Bromenshenk et al. (2011) PLOS One 5: e13181




• Big pitfall: Search database was only composed by viral

proteins. Not bee proteins at all!!

• After researching the data, there is no evidence for viral

peptides/proteins in any of their data: honey bee, fruit fly,

wasp, moth, human keratin, bacteria that like sugary

environments, …

• “We believe that there is currently insufficient evidence to

conclude that bees are a natural host for IIV-6, let alone that

the virus is linked to CCD”.Info from R. Chalkley

Knudsen & Chalkley (2011) PLOS One 6:

e20873

Foster (2011), MCP 10: M110.006387




Reprocessing for the validation of controversial data

Datasets PXD000561 and PXD000865 in PRIDE Archive



Various reanalysis of these datasets have been performed…

Reanalysis of Pandey dataset (Nature, 2014) made by J. Choudhary’s group at

Sanger Institute

Wright et al., Nat Commun, 2016Dataset PXD000561

http://www.ebi.ac.uk/gxa

http://www.ebi.ac.uk/gxa



Meta-analysis approaches

• Putting data coming from a lot of experiments

together, to extract new knowledge. Examples:

• Peptide fragmentation patterns.

• Retention time prediction.

• Data integration of experiments done at different time

points.

• Two different approaches:

• Data as submitted

• Reanalysis all the data together



Meta-analysis recent examples

Lund-Johanssen et al., Nat Methods, 2016 Drew et al., Mol Systems Biol, 2017



Repurposing

• Data are considered in light of a question or a

context that is different from the original study.

• Proteogenomics studies

• Discovery of novel PTMs.



Examples of repurposing datasets: proteogenomics

Data in public resources can be used for genome annotation purposes



Public datasets from different omics: OmicsDI

http://www.omicsdi.org/

• Aims to integrate of ‘omics’ datasets (proteomics,

transcriptomics, metabolomics and genomics at present).

PRIDE

MassIVE

jPOST

PASSEL

GPMDB

ArrayExpress

Expression Atlas

MetaboLights

Metabolomics Workbench

GNPS

EGA

Perez-Riverol et al., Nat Biotechnol, 2017

http://www.omicsdi.org/



Repurposing: new PTMs found

• Individual authors can reprocess raw data with new

hypotheses in mind (not taken into account by the original

authors).

• Recent examples (using phosphoproteomics data sets):

• O-GlcNAc-6-phosphate1

• Phosphoglyceryl2

• ADP-ribosylation3

1Hahne & Kuster, Mol Cell Proteomics (2012) 11 10 1063-92Moellering & Cravatt, Science (2013) 341 549-553

3Matic et al., Nat Methods (2012) 9 771-2



What do researchers think about the main uses

of public proteomics data?

• Verify published results

• Build Spectral libraries

• Find interesting datasets for reanalysis:

• Find new splice isoforms

• Discover new PTMs

Source: MassIVE workshop, ASMS 2017, Nuno Bandeira et al.



Do you want to know a bit more…?

http://www.slideshare.net/JuanAntonioVizcaino

Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016

http://www.slideshare.net/JuanAntonioVizcaino



EXERCISE



Vaudel M, Barsnes H, Berven FS, Sickmann A,

Martens L:

Proteomics 2011;11(5):996-9.

https://github.com/compomics/searchgui https://github.com/compomics/peptide-shaker

Vaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L,

Barsnes H:

Nature Biotechnology 2015; 33(1):22-4.

CompOmics Open Source Analysis Pipeline



Find the desired PRIDE project …

… and start re-analyzing the data!

… inspect the project details ….

Reshake PRIDE data!

Download - Reuse of public proteomics data

Top Related