the representation of scientific data
DESCRIPTION
The Representation of Scientific Data. [email protected]. Overview. Recording archiving and sharing the process and the results of experimental data is a challenge What to store? How to store it? Why?. Science is complicated. Technology. Complex experimental workflow - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/2.jpg)
Overview
• Recording archiving and sharing the process and the results of experimental data is a challenge
What to store?
How to store it?
Why?
![Page 3: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/3.jpg)
Science is complicated
![Page 4: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/4.jpg)
Technology
• Complex experimental workflow• Advances in instrumentation• High-through methods
![Page 5: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/5.jpg)
![Page 6: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/6.jpg)
Analysis is complicated
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
![Page 7: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/7.jpg)
Analysis
• New algorithms and software• Data integration• From multiple sources• Genomics• Proteomics• Metabolomics• Neuroscience• Systems biology
![Page 8: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/8.jpg)
2D Image analysis
A B
C D
Added alignment
vector
Alpha blend display anim ates betw een current
and reference
Currentfocus
![Page 9: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/9.jpg)
Problems
• “In the standard model, one collects data, publishes a paper or papers and then gradually loses the original dataset.”
• THE NEW KNOWLEDGE ECONOMY AND SCIENCE AND TECHNOLOGY POLICY Geoffrey Bowker, University of California, San Diego
![Page 10: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/10.jpg)
Problems
• Large, complex datasets are commonplace,
• Heterogeneous data formats– Vendor specific, Lab specific
• Multitude of analysis methods– Proprietary, open source
![Page 11: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/11.jpg)
Benefits
• Knowledge discovery – results
• Sharing of best practice
• Evaluation of results
• Sharing of data
• Re-use
![Page 12: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/12.jpg)
Re-use of neuroscience datasets
• Data that is shared and can be interpreted can often be used to address multiple questions.
• Data that have been collected with one question in mind often turn out to be highly valuable to address other questions
• (1) Hippocampus recordings for mapping place fields were the basis for high-profile papers addressing questions concerning temporal organization of neural codes (PMID: 12891358 ).
• (2) Paired recordings using extracellular and intracellular electrodes originally collected for detecting dendritically generated action potentials provide ground truth for testing and comparing spike-sorting techniques (PMID: 10899214 ).
![Page 13: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/13.jpg)
CARMENCode, Analysis, Repository and Modelling for e-Neuroscience
www.carmen.org.uk
Engineering and Physical Sciences Research Council
![Page 14: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/14.jpg)
Virtual Laboratory for Neurophysiology
• Enabling sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated
![Page 15: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/15.jpg)
Cost
• Infrastructure
• Acquisition – data and metadata
• Developing a common representation
• Potential benefits are not always experienced by data producers
• Lab experimenter vs bioinformatician
![Page 16: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/16.jpg)
Data pyramid
Raw data
Derived data
Results
Processing
![Page 17: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/17.jpg)
Mass Spectrometry Data pyramid
Raw data
Derived data
Results
Processing
![Page 18: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/18.jpg)
How do we store the data?
• Dictated by form of access• Raw data, typically vendor specific formats for
vendor specific software analysis• Derived data – unlimited formats – higher level
of access required to determine results• Results – often queries over derived data• Problematic if derived data are represented in
inconsistent structures • – consistent representation is valuable
![Page 19: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/19.jpg)
Metadata
• Description of results• Sample• How it was generated• Equipment• Processing steps• Expensive to capture• Important to validate
result
Lab-book
Lab-book
Lab-book
Lab-book
Lab-book
Lab-book
Lab-book
Lab-book
Lab-book
![Page 20: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/20.jpg)
Standards
• Science is a challenge• Scientific data is complex• Different data representations add further
complexity to complex science• We need a common representation of data
to get back to just complex science• Lots of individuals have created formats in
isolation – only works for their data in their lab
![Page 21: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/21.jpg)
What is a standard?
• “established by consensus and approved by a recognized body, that provides, for common and repeated use, rules, guidelines or characteristics for activities or their results, aimed at the achievement of the optimum degree of order in a given context“
• BSI -• http://www.bsi-global.com/en/Standards-and-Publications/About-standards/Glossary/
![Page 22: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/22.jpg)
Community standards development
![Page 23: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/23.jpg)
KnowledgeKnowledge
Standards: allow working together for knowledge discovery
![Page 24: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/24.jpg)
Standards bodies
• W3C -World wide web consortium (W3C)
• IEEE - Institute of Electrical and Electronics Engineers
• OMG – Object management group
![Page 25: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/25.jpg)
Life science communities
Society Domain Website
The Genomics Standards Consortium (GCS)
Genomics http://darwin.nox.ac.uk/gsc/
Microarray and Gene Expression Data Society (MGED)
Genomics www.mged.org
Proteomics Standards Initiative (PSI)
Proteomics http://psidev.info
Metabolomics Standards Initiative (MSI)
Metabolomics www.metabolomicssociety.org
Flow Cytometry experiment Community
Flow Cytometry
www.flowcyt.org
![Page 26: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/26.jpg)
![Page 27: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/27.jpg)
Technologies for data standards
• Important to adopt a technology that provides a clear representation of the domain
• The model and the model documentation capture a shared understanding of the domain
• Many technologies exist which support modelling
• Each focuses on a different use such a validation, code generation and data transmission
![Page 28: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/28.jpg)
Technologies being used
• Simple text documents or spreadsheets
• XML - Extensible Markup Language
• RDF – Resource Description Framework
• UML – Unified Modeling Language
• OWL – Web ontology Language
• OBO – Open Biomedical Ontology format
![Page 29: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/29.jpg)
Simple documents
• A list of what is required
• MIxxx Minimum information XXX
• MIAME
• Minimum information about a Microarray Experiment
• MAIPE
• Minimum information about a Proteomics Experiment
![Page 30: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/30.jpg)
MIAPE:GE
• Identifies the minimum information required to report the use of n-dimensional gel electrophoresis in a proteomics experiment
![Page 31: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/31.jpg)
![Page 32: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/32.jpg)
![Page 33: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/33.jpg)
XML
• Widely used for representing biological information• Mark up sections with elements• Validates against a schema
<lecture><to>Bioinformatics students</to><from>Frank Gibson</from><title>Representation of scientific data </title><feedback>Students all fell asleep </feedback></lecture>
![Page 34: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/34.jpg)
UML
• An implementation independent model
• Allows multiple technology implementations of the same model
• Such as
• XML, JAVA, Relational tables
![Page 35: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/35.jpg)
The numbers indicate the multiplicity of the relationship with * meaning “many”. One or more instances of JetEngine can be associated with one or more instances of Aeroplane
A filled diamond indicates containment. An Aeroplane can not exist without a JetEngine
An arrow shows the direction of the relationship. An open-headed arrow indicates inheritance. A Pilot and a Passenger are both instances of Person, inheriting the attributes “name” and “DOB”.
1..* 1..*
![Page 36: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/36.jpg)
Functional Genomics Experiment (FuGE)
• Model of common components in science investigations, such as materials, data, protocols, equipment and software.
• Provides a framework for capturing complete laboratory workflows, enabling the integration of pre-existing data formats.
![Page 37: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/37.jpg)
![Page 38: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/38.jpg)
GelML
![Page 39: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/39.jpg)
RDF
• Overcomes limited expressivity of XML• Allows the semantic meaning of statements to
be captured
G TR 4_M O U SESlc2a4
hasG eneProduct
Subject
P redicate
O bject
![Page 40: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/40.jpg)
Uniprot(beta) in RDF
![Page 41: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/41.jpg)
Ontolgies for Life science
• Emergence has occurred for two reasons
• Consistent annotation of data
• To add meaning and understanding that can be interpreted computationaly
• Bio-ontologies registered on the OBO foundry
![Page 42: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/42.jpg)
![Page 43: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/43.jpg)
Bio-ontologies
• OBO format
• Flat file format, more suited to controlled vocabularies, made popular by GO
• OWL
• W3C recommendation, designed for computers not humans
![Page 44: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/44.jpg)
sepCV InOBO
![Page 45: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/45.jpg)
OBI
• An ontology for all investigations in the life sciences
• Implemented in OWL• Large community
involvement• sepCV to be
integrated within OBI
![Page 46: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/46.jpg)
Tools
• Tools are important• Biologist don’t want to look at XML• Need data entry tools – a website…• Direct export of data and metadata from
instruments• Equipment vendors and manufactures need to
be involved in the “community” of standards development
• Tools lag behind development of the standard
![Page 47: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/47.jpg)
Symba - data entry and storage
![Page 48: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/48.jpg)
The Representation of Scientific Data
The Road Map
![Page 49: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/49.jpg)
Patience
• Standards development is slow it requires
• A measure of technical and political consensus
• An organisational framework
• Individuals who are willing to contribute time and expertise, both domain experts and knowledge engineers (modellers)
![Page 50: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/50.jpg)
The Problem
• Identify the problem
• Identify the users that need the problem solved
• Requirements gathering – what do the users need?
• See if someone else has already done it!
• If so, use it and go to the pub
![Page 51: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/51.jpg)
Implementation
• Define the problem – MIxxx
• Model the problem – UML (FuGE)
• Generate an implementation (XML)
• Define semantics - Ontologies
![Page 52: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/52.jpg)
Testing and ReviewStage One: Requirements gathering
– Extensive interactions with the community– Consideration of several (informal) use cases– Internal generation of first draft of guidelines
Stage Two: Module Testing– Guidelines used to document real experiments– Feedback gathered on coherence and practical usability
Stage Three: Committee review– Build an invited panel of leaders in the particular technique– Send draft for ‘review’ by experts on an individual basis– Final round of discussion by panel on email list
Stage Four: Controlled release– Make the module publicly available– Recommend to organised groups and proactive individuals– Provide mechanisms to gather feedback– Released alongside practical examples of use cases
Stage Five: Enforcement– Offer the module to journals, repositories and funders for review, with a view to their enforcing it
(either to get published, or to get money)
Cycle
![Page 53: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/53.jpg)
Candidate Recommendation
submitted to PSI Editor
PSI Editor reviews draft
PSI Editor submits draft to PSI-SG
PSI Editor returns Draft Revise
Pass
15 Day PSI-SG Comment
PSI Editor reviews
comments
PSI Editor posts & announces
PSI Working Draft Proposal
(PWD-R.P)
Revise
Pass
30-day Public Comment
PSI Editor reviews
comments
PSI Editor returns Draft, remove
PWD from indexRevise
PSI Editor posts & announces PSI Final
Document Proposal(PFD-R.P)
Pass
PSI-WG submits PFD-R.P with supporting documents (tutorials,etc)
To PSI-SG requestingPFD-R status
PSI-SG reviews request
PSI-SG Provide Feedback to WG
Chairs
PSI-SG and PSI Editor conduct
Formal External Review
Pass
Revise
60 day Formal Review and Public
Comment
PSI-SG Examines Reviews Revise
PSI Editor posts & announces PSI Final
Document(PFD-R)
Pass
![Page 54: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/54.jpg)
Tool support
• Tool support
• Can occur in parallel but often after release
• Abstraction away from the model
• Simple data entry – often website
![Page 55: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/55.jpg)
Standards for Gel electrophoresis
MAIPEGE
MAIPEGI
LaboratoryPublic repositoriesData entry and transfer
I) GelML data entry tools
GelML
II) Direct database submission
III) Automated export of GelInfoML
sepCV
![Page 56: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/56.jpg)
Pitfalls
• Re-invention. Don’t re-invent the wheel! If it exists use it
• Over ambition: pragmatic compromise don’t over complicate it or it will not get used. - keep it simple stupid
• Under investment – money, time, but most importantly with the people that will use it.
![Page 57: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/57.jpg)
What is the point?
• Facilitate consistent computational analysis
• Develop one piece of code to do one thing instead of lots of code to do one thing
• Easier lab management of data
• Storage and analysis
• Allow data integration and systems biology
• Efficient science
![Page 58: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/58.jpg)
Take away message
• Mixx
• FuGE
• OBI
• They have done the hard work
• Re-use, extend and contribute
![Page 59: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/59.jpg)
Questions?
![Page 60: The Representation of Scientific Data](https://reader035.vdocuments.site/reader035/viewer/2022062804/56814a58550346895db77dab/html5/thumbnails/60.jpg)
Data mining
mine
mine
Keepout
mine
Data is mine, mine mine….
Data store