data standards from the proteomics standards initiative andy jones [email protected] university...

19
Data standards from the Proteomics Standards Initiative Andy Jones [email protected] University of Liverpool

Upload: gyles-blake

Post on 27-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

Data standards from the Proteomics Standards Initiative

Andy [email protected] of Liverpool

Page 2: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

Overview

• HUPO-PSI background• Data formats

– Protein and peptide separations• GelML• spML

– Mass spectrometry and proteomics informatics– mzML– mzIdentML– mzQuantML

Page 3: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

HUPO-PSI background

• HUPO was founded in 2001 with several objectives:– Consolidate worldwide proteome organisations– Assist in the coordination of public proteome initiatives– Engage in scientific and educational activities

• Tissue proteome projects and other initiatives:– Plasma, Liver, Brain, Glyco and Antibody initiative– Proteomics Standards Initiative (PSI)

• HUPO-PSI“The HUPO Proteomics Standards Initiative (PSI) defines community standards

for data representation in proteomics to facilitate data comparison, exchange and verification.”

• Main outputs are:• Minimum reporting guidelines (MIAPE modules)• Data exchange formats (usually in XML)• Ontologies or Controlled vocabularies

Page 4: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

PSI main outputs• MIAPE – minimum information about a proteomics experiment

– Information that should be recorded about a proteomics experiment (Taylor et al. Nature Biotechnology 25, 887-893; 2007)

– Modules: gel electrophoresis, gel image informatics, capillary electrophoresis, column chromatography, mass spectrometry, mass spectrometry informatics and molecular interactions

• Data formats for:– molecular interactions– mass spectrometry– protein identifications– gel electrophoresis and other separation methods

• Plus supporting controlled vocabularies for each format• All outputs must pass a stringent standardisation process

– Specifications reviewed by public comment and anonymous review– PSI editor will not sign off specification until reviewers’ comments have been satisfied

Page 5: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

PSI data formats

mzML(Mass spec)

mzIdentML(Protein

Identifications)

mzQuantML(Protein

Quantifications)

Protein separation Mass spectrometry Proteomics Informatics

GelML

spML

• 2007-01-18 GelML 1.0• Current: GelML 1.1 (no

formal release yet)

• 2007 - milestone 2• No active development...

• 2008-06-01 mzML 1.0.0 released• 2009-06-01 mzML 1.1.0 released

Previous /related standardsmzData v1.0.5 (PSI)mzXML (from ISB)

• 20-08-2009 mzIdentML 1.0.0 • Early drafting only

MI(molecular

interactions)Version 2.5

Page 6: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

GelMLData format for exchanging protocols and image data resulting

from gel electrophoresis, extension of FuGE

• Contents:– Models of 1D and 2D separation, electrophoresis protocol, detection,

and includes DIGE

• Status: – v1.0 was built by extending complete FuGE model; version 1.1 extends

from “FuGElight”– v1.1 simplified protocols e.g. for electrophoresis (free-text not

parameterized)– v1.1 shares the same CV structure as mzML and mzIdentML– v1.1 implemented in ProteoRed MIAPE database, beta

implementation in MIAPEGelDB (SIB)

Page 7: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

spMLData exchange format for non-gel based separations,

extension of FuGE

• Contents:– Multi-dimensional chromatography, generic model for other

types of separation (capillary electrophoresis, rotofors, centrifugation etc.)

• Status: – Milestone 2 extended from FuGE; – some work has been done to convert this to same structure as

GelML v1.1– No active development for some time, decision to be taken at

next PSI meeting about community requirement for format

Page 8: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzML History

mzData1.05

mzXML3.0

mzML0.90

SFO2006-05

dataXML0.6

DC2006-09

ISB2006-11

Lyon2007-04

EBI2007-06

mzML0.91

PSI Doc Proc2007-11

mzML0.99 RC

Toledo2008-04

mzML1.0.0

Release!2008-06

mzML1.1.0RC5

Turku2009-04

mzML1.1.0

Release!2009-06

Early Development

Final Development

Page 9: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzML

run

spectrum

spectrumDescription

binaryDataArray

binaryDataArray

• • •

precursorList

scan

spectrumList

• • •spectrumspectrum

cvList

referenceableParamGroupList

sampleList

acquisitionSettingsList

dataProcessingList

softwareList

instrumentConfigurationList

chromatogramList

• • •chromatogramchromatogram

chromatogram

binaryDataArray

binaryDataArray

Each spectrum contains a header with scan information and optionally precursor information, followed by two or more base 64 encoded binary data arrays.

Chromatograms may be encoded in mzML in a special element that contains cvParams to describe the type of chromatogram, followed by two base64-encoded binary data arrays.

Page 10: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzML implementations

Page 11: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzIdentML overview• Various software packages for searching:

– MASCOT, SEQUEST, X!Tandem, Omssa, Inspect...– Each piece of software has own output format– User interacts with results formatted as web pages– Not easy to submit to databases or re-analyse results

• mzIdentML– Standard format for results of searches with mass spec data– Can capture results from PMF and tandem MS– Flexible model of peptide and protein identifications– Capture search engine parameters, scores and modifications

using controlled vocabulary terms <Modification location="7" residues="M" monoisotopicMassDelta="15.994919"> <cvParam accession="UNIMOD:35" name="Oxidation" cvRef="UNIMOD" />

Page 12: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzIdentMLcvList

AnalysisSoftwareList

AnalysisSampleCollection

SequenceCollection

AnalysisCollection

AnalysisProtocolCollection

DataCollection

Software packages

Biological samples

DB entries of protein / peptide sequences

inputs = external spectra1..n

output = SpectrumIdentificationList1

SpectrumIdentificationProtocol

ProteinDetectionProtocol

SpectrumIdentificationProtocolAdditionalSearchParams

ModificationParams

Enzymes

DatabaseFilters

Inputs

AnalysisData

AnalysisDataSpectrumIdentificationList

The database searched and the input file converted to mzIdentML

SpectrumIdentificationResult

SpectrumIdentificationItem

ProteinDetectionListProteinAmbiguityGroup

ProteinDetectionHypothesis

All identifications made from searching one spectrum

One (poly)peptide-spectrum match

A set of related protein identifications e.g. conflicting peptide-protein assignments

A single protein identification

SpectrumIdentification

ProteinDetectionInputs= SpectrumIdentificationLists

output =ProteinDetectionList

mzIdentML

Schema overview

Page 13: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzIdentML

SpectrumIdentificationList 1SpectrumIdentificationResult 1

SequenceCollectionDBSequenceAccession = “HSP7D_MANSE”Seq = “MAKAPAVGIDLGTTYSCVGVF... “

PeptideSeq = “DAGMISGLNVLR”Mod = Methionine oxidation (pos 4)

SpectrumIdentificationItem 1_1

Score = 67.2E-value = 0.000867Rank = 1

DBSequenceAccession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...”

PeptideEvidence 1_1_Bstart=160 end=171 pre=K post=L

SpectrumIdentificationItem 1_2

PeptideEvidence 1_2_Astart=54 end=65 pre=K post=T

Score = 54.4E-value = 0.026Rank = 2

external data

spectrum

spectrum

spectrum

spectrum

spectrum

mzIdentML

Peptide identifications

PeptideEvidence 1_1_Astart=161 end=172 pre=K post=I

Page 14: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzIdentML

ProteinDetectionListProteinAmbiguityGroup 1

SpectrumIdentificationList

SpectrumIdentificationResult 3

SpectrumIdentificationResult 2

SpectrumIdentificationResult 1

SpectrumIdentificationItem 2_1

PeptideEvidence 2_1_A

SpectrumIdentificationItem 3_1

PeptideEvidence 3_1_A

ProteinDetectionHypothesis 1_1

PeptideHypothesis (3_1_A)

PeptideHypothesis (2_1_A) PeptideHypothesis (1_1_A) Score = 141

Peptide coverage = 17%E-value = 0.0034

PeptideEvidence 3_1_B

ProteinDetectionHypothesis 1_2

PeptideHypothesis (1_1_B)

SpectrumIdentificationItem 1_1

PeptideEvidence 1_1_A

PeptideEvidence 1_1_B

SequenceCollectionDBSequenceAccession = “HSP7D_MANSE”Seq = “MAKAPAVGIDLGTTYSCVGVF... “

DBSequenceAccession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...”

mzIdentML

Protein identifications

Protein ambiguity group- Groups proteins that share the

same set of peptides (protein inference problem)

Protein Detection Hypothesis- One potential protein hit supported by peptide evidence

Page 15: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzIdentML

ProteinDetectionListProteinAmbiguityGroup 1

SpectrumIdentificationList

SpectrumIdentificationResult 3

SpectrumIdentificationResult 2

SpectrumIdentificationResult 1

SpectrumIdentificationItem 2_1

PeptideEvidence 2_1_A

SpectrumIdentificationItem 3_1

PeptideEvidence 3_1_A

ProteinDetectionHypothesis 1_1

PeptideHypothesis (3_1_A)

PeptideHypothesis (2_1_A) PeptideHypothesis (1_1_A) Score = 141

Peptide coverage = 17%E-value = 0.0034

PeptideEvidence 3_1_B

ProteinDetectionHypothesis 1_2 PeptideHypothesis (1_1_B)

PeptideHypothesis (3_1_B)

Score = 85Peptide coverage = 12%E-value = 0.055

SpectrumIdentificationItem 1_1

PeptideEvidence 1_1_A

PeptideEvidence 1_1_B

SequenceCollection

DBSequenceAccession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...”

mzIdentML

Protein identifications

ProteinDetectionHypothesis 1_1 has 3 peptides:ESTLHLVLRTLSDYNIQKTITLEVEPSDTIENVK ProteinDetectionHypothesis 1_2 has 2 peptides:ESTLHLVLRTLSDYNIQK

Stronger evidence supporting hypothesis 1 but they are placed within the same ambiguity group

Page 16: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzIdentML now available for export from Mascot in the next release

Page 17: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

Sequest converter produced by MPC (Germany) as part of ProDac consortium:http://www.medizinisches-proteom-center.de

Thermo also working on an “official” exporter

• Basic scripts available for converting other search engine formats (X!Tandem, Omssa, pepXML)

• Export in next version of Scaffold• Database implementation in PRIDE is coming...

Page 18: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

mzQuantML• Format to capture proteins quantified from MS data

– Very early drafting• Many methods of quantification

– Label/tag based• Stable isotopes (SILAC)• Tags: ICAT / iTRAQ

– Label-free• Extracted ion chromatogram – align parallel runs• Spectral counting

• Methods still in flux– New methods reported frequently in the literature

• Will need to reference back to spectra (+chromatograms) and identifications– Needs more community input – please offer to help!

Page 19: Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

Acknowledgements

• PSI workgroups:– Protein separation

• Chair: Juan-Pablo Albar (ProteoRed)

– Mass spectrometry• Chair: Eric Deutsch (ISB)

– Proteomics Informatics• Chair: Andy Jones (Liverpool)• Co-Chair: David Creasy (Matrix Science)

– Molecular interactions• Chair: Henning Hermajakob (and chair of PSI)

• and many developers worldwide...See: http://www.psidev.info/