enabling systems biology: development and implementation of proteomics standards … ·...
TRANSCRIPT
Enabling Systems Biology:
Development and Implementation of
Proteomics Standards and Services
Engineering 1850
•Nuts and bolts fit perfectly
together, but only if they
originate from the same
factory
•Standardisation proposal in
1864 by William Sellers
•It took until after WWII until it
was generally accepted,
though …
Proteomics today
•Proteomics results are perfectly
compatible, but only if they are from
the same lab, from the same
software
•Fragmentation of proteomics data
•“Publish and vanish”
Proteomics Data Sharing
Incompleteness of the
public record
•Nucleotide sequences, protein sequences,
macromolecular 3D structures, DNA microarrays:
Database submission mandatory
•Proteomics: No standardised reporting, no standard
database submission
•Proteomics data is generated at a high rate, and lost at
a high rate
•Simple question like “Give me all tissues in which my
protein of interest was identified” are currently
unanswerable
•Experiments are repeated unnecessarily, the field
advances slower than necessary
The tide is turning, though …
•Bradshaw RA, Burlingame AL, Carr S, Aebersold R.
Reporting protein identification data: the next
generation of guidelines.
Mol Cell Proteomics. 2006 May;5(5):787-8.
•Wilkins et al.
Guidelines for the next 10 years of proteomics.
Proteomics. 2006 Jan;6(1):4-8.
•Nature Biotechnology 2006, Nov:
• Editorial: Standard Operating Procedures
• Burgoon LD. The need for standards, not guidelines, in biological data
reporting and sharing.
• Ball C. Are we stuck in standards?
•Nature Biotechnology: Community Consultation on
Standards: http://www.nature.com/nbt/consult/index.html
Community Consultation
•Nature Biotechnology community consultation
•http://www.nature.com/nbt/consult/index.html
•Currently nine “standards” papers on NBT website for public
consultation, thereof six from PSI
• MIAPE parent
• MIAPE MS
• MIAPE MS Informatics
• MIAPE Gel
• MIAPE MI
• PSI MO
HUPO Proteomics Standards Initiative
•Develop data format standards
•Data representation and annotation
standards
•Involve data producers, database providers,
software producers, publishers
•Open community initiative
PSI deliverables
•Minimum Information about a Proteomics
Experiment (MIAPE)
•XML schema
•Detailed controlled vocabularies
•Support tools
Document process
• Significant investments
into PSI standards require
formal process for PSI
standards
• Process ensures good
balance between expert
design and public scrutiny
• Document process
approved at PSI spring
meeting, San Francisco,
April 2006
• Now in implementation
PSI work groups
PSI-MI
Molecular
Interactions
PSI-MS
Mass
Spectrometry
PSI-MOD
Separations
FuGE
Functional Genomics Experiment model
MGED collaboration
PSI-MI
Molecular
Interactions
PSI-MS
Mass
Spectrometry
MGED
MIAME
MAGE-OM
Microarray
Standard
PSI-MOD
Separations
FuGE
Functional Genomics Experiment model
PSI work groups: MI
PSI-MI
Molecular
Interactions
PSI-MS
Mass
Spectrometry
MGED
MIAME
MAGE-OM
Microarray
Standard
PSI-MOD
Separations
PSI-MI community standard
•Community standard for Molecular Interactions
•Jointly developed by major data providers: BIND, CellZome, DIP, GSK, HPRD, Hybrigenics, IntAct, MINT, MIPS,
Serono, U. Bielefeld, U. Bordeaux, U. Cambridge, and others
•XML schema
•Controlled vocabularies
•Tools
•Minimum requirements (submitted)
•Implementated by major data providers
•PSI develops not only formats, but also
controlled vocabularies/ontologies where
necessary
•Example: > 20 ways to write: yeast two hybrid, Y2H, 2H, yeast-two-hybrid, two-hybrid, …
•Ca. 800 terms, fully defined and cross-
referenced
•GO format
PSI-MI controlled vocabularies
PSI-MI format development
•Iterative development:
Do the feasible first, leave the unfeasible for later
•Version 1.0 published in February 2004
• The HUPO PSI Molecular Interaction Format - A community standard for the
representation of protein interaction data.
Henning Hermjakob et al,
Nature Biotechnology 2004, 22, 176-183.
•Version 2.5 released December 2005
• Technical improvements
• Quantitative parameters
• Additional interactor types: DNA, RNA, small molecules
• Additional, simplified tabular format
• Submitted
PSI-MI Support
•Data: DIP, HPRD, IntAct, MINT, MIPS, …
•Tools
•Conversion Tabular – PSI XML
•XML -> HTML
•Semantic validation
•Visualisation
•PimWalker ®: http://pim.hybrigenics.com/pimwalker
•ProViz: http://cbi.labri.fr/eng/proviz.htm
•Cytoscape: http://www.cytoscape.org
IntAct as an implementation of PSI MI
•Curated molecular interaction database
•128.000 binary interactions
•Open source
•Open data
•http://www.ebi.ac.uk/intact
IntAct curation
•Detailed, “deep” curation
•Based on full text papers
•Experimental conditions
•Detailed interactor identification
•Use of detailed controlled vocabularies
•Annotation of binding domains, protein modifications, etc.
The IMEx consortium
•International Molecular-Interaction Exchange
consortium
•DIP, IntAct, MINT, MIPS
are establishing an exchange of curated literature data
in PSI-MI format from summer 2006 onwards to
provide a network of stable, comprehensive resources
for molecular interaction data
•Aims:
•Consistent body of public data
•Avoid redundant curation
•http://imex.sf.net
IMEx data deposition
•Deposition of published data in one of the IMEx
databases is strongly encouraged
•Any dataset submitted in one of the IMEx databases
will be replicated to the other IMEx databases
•IMEx partners are already co-ordinating their curation
efforts now
•Public guidelines:Orchard et al. The Minimum Information on a Molecular Interaction
Experiment (MIMIx).
Nature Biotechnology, accepted.
FuGE
Functional Genomics Experiment model
PSI work groups: MS
PSI-MI
Molecular
Interactions
PSI-MS
Mass
Spectrometry
MGED
MIAME
MAGE-OM
Microarray
Standard
PSI-MOD
Separations
Mass spectrometry: PSI-MS
•mzData format as common instrument output format
•Format beta version accepted in Nice, April 2004
•EBI workshop July 2004
•Version 1.05 released January 4, 2005
•Next revision spring 2007, in collaboration with the Institute
for Systems Biology (ISB), merging mzData and mzXML
•Controlled vocabularies developed jointly with ASTM
•Key concept:
Request direct vendor support to avoid version problems due
to vendor API changes
•Move to mzML (merge of mzData and mzXML)
Current mzData support•Applied Biosystems
•Bruker
•EBI
•GeneBio
• Insilicos
•Kratos
•MatrixScience
•Swiss Institute of Bioinformatics
•GPM
•Thermo Electron
•Waters
Mass spectrometry: PSI-MS
•analysisXML format as common search
engine output format
•Suggested in Nice, April 2004
•Further developed in Siena, April 2005
•Aim: Facilitate comparison and archiving of search
engine output, in particular in comparative projects
like the HUPO PPP
•Beta release under internal review
PSI-MS based data flow
proprie-
tary
format
mass
spectrometer B
mass
spectrometer A converter
mzData
search
engine A
search
engine B
analysisXML
Public repository
PRIDE – Protein Identification Database
•Turns publicly available data into publicly accessible
data
•Protein identifications
•Experimental detail
•Peak lists
•Linkout to raw data
•Fully open source
•Fully open data
•Implementation of PSI standards as they are released
Data views
Experiment Comparison
Lab B
Private Data in
PRIDE “Collaboration”
Comparison
Reviewer
Lab A
Lab C
PRIDE private mode
Publicly available data
•Private mode allows data
analysis within a
collaboration
•PRIDE tools are already
accessible in private mode, in
particular experiment
comparison (alpha)
•On manuscript submission,
reviewers can access the data
in standard format
Lab B
Private Data
“Collaboration”
Reviewer
Lab A
Lab C
PRIDE private mode
Publicly available data
•Private mode allows data
analysis within a
collaboration
•PRIDE tools are already
accessible in private mode, in
particular experiment
comparison (alpha)
•On manuscript submission,
reviewers can access the data
in standard format
•On manuscript publication,
the data becomes public
Data entry
•Register
•XML-based data deposition
• Target group: Larger labs with good bioinformatics support, large scale
data sets
•Generate PRIDE XML directly
•Supporting toolkit currently under development
• Fully automated, web-based submission
•Excel-based
• Target group: Smaller labs, low to medium throughput
• “Biologists love Excel”
•Advanced Excel spreadsheet will allow user input in “familiar” Excel
environment
•Spreadsheet supports use of controlled vocabularies and validation
•Automatic submission direction from spreadsheet into PRIDE
Medium term vision
•Collaborate with regional or project centers for data collection and
analysis
•Establish data exchange and collaboration between PeptideAtlas,
GPMDB, PRIDE, PRIDE@NPC, …
•Provide a set of compatible, synchronized, public resources for protein
identification data
Regional
Center
PRIDEPeptide
Atlas
Regional
CenterHUPO
xPP
PRIDE@NPC
Acknowledgements
•All PSI participants
• Luisa Montecchi-Palazzi
• Sandra Orchard
• Chris Taylor
• Randy Julian (Lilly)
• Patrick Pedrioli (ISB, ETH)
•PRIDE
• Phil Jones
• Lennart Martens
• Richard Cote
• Sebastian Klie
•BBSRC ISPIDER Grant
•BBSRC ProteomeHarvest Grant
•EU ProDaC grant
•Henning Hermjakob
•http://www.psidev.info
Resources
•http://psidev.sf.net
•http://imex.sf.net
•http://www.ebi.ac.uk/intact
•http://www.ebi.ac.uk/pride
•http://www.nature.com/nbt/consult/index.html