2013 01-14 ops-dataset_descriptions

17
Dataset Descriptions in Open PHACTS Alasdair J G Gray University of Manchester W3C HCLS Call – 14 January 2013 www.openphacts.org/specs/datadesc/ Authors: Christian Y. A. Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J. G. Gray, Andra Waagmeester and Egon L. Willighagen

Upload: alasdair-gray

Post on 29-Aug-2014

596 views

Category:

Documents


0 download

DESCRIPTION

Alice: "What version of ChEMBL are we using?" Bob: "Er…let me check. It's going to take a while, I'll get back to you." This simple question took us the best part of a month to resolve and involved several individuals. Knowing the provenance of your data is essential, especially when using large complex systems that process multiple datasets. The underlying issues of this simple question motivated us to improve the provenance data in the Open PHACTS project. We developed a guideline for dataset descriptions where the metadata is carried with the data. In this talk I will highlight the challenges we faced and give an overview of our metadata guidelines. Presentation given to the W3C Semantic Web for Health Care and Life Sciences Interest Group on 14 January 2013.

TRANSCRIPT

Page 1: 2013 01-14 ops-dataset_descriptions

Dataset Descriptions in Open PHACTS

Alasdair J G GrayUniversity of ManchesterW3C HCLS Call – 14 January 2013

www.openphacts.org/specs/datadesc/

Authors:Christian Y. A. Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J. G. Gray, Andra Waagmeester and Egon L. Willighagen

Page 2: 2013 01-14 ops-dataset_descriptions

Why?

Public Domain Drug Discovery Data:Pharma are accessing, processing, storing & re-processing

LiteraturePubChem

GenbankPatents Databases

Downloads

Data Integration Data Analysis Firewalled Databases

Repeat @ each

companyx

Page 3: 2013 01-14 ops-dataset_descriptions

The Project

The Innovative Medicines Initiative• EC funded public-private

partnership for pharmaceutical research

• Focus on key problems– Efficacy, Safety,

Education & Training, Knowledge Management

The Open PHACTS Project• Create a semantic integration hub (“Open

Pharmacological Space”)…• Delivering services to support on-going drug

discovery programs in pharma and public domain• Not just another project; Leading academics in

semantics, pharmacology and informatics, driven by solid industry business requirements

• 13 academic partners, 9 pharmaceutical companies, 6 SMEs

• Work split into clusters:• Technical Build (focus here)• Scientific Drive• Community & Sustainability

Page 4: 2013 01-14 ops-dataset_descriptions

Architecture

User Interfaces & Applications

Linked Data API

Linked Data CacheIdentity

Mapping Service

Identity Resolution

Service

Domain Specific Services

Data

Page 5: 2013 01-14 ops-dataset_descriptions

Datasets and Links

Page 6: 2013 01-14 ops-dataset_descriptions

ChemSpider• ChemSpider aggregates data from

over 400 sources• Central integration point for

chemicals in OPS• OPS data covers

– ChEBI– ChEMBL– DrugBank

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 6

Page 7: 2013 01-14 ops-dataset_descriptions

What version of ChEMBL? ~Jan 2012• ChemSpider: EBI SDF file

– ChEMBL 13 • Data Cache: Chem2Bio2RDF ChEMBL RDF

– File downloaded May 2011– Chem2Bio2RDF metadata webpages:

ChEMBL 8– File: ChEMBL 2

• Mapping Server: Kasabi ChEMBL RDF file– ChEMBL 12

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 7

Page 8: 2013 01-14 ops-dataset_descriptions

For the record• OPS currently uses ChEMBL 13

– RDF generated from EBI database dump

– Published at linkedchemistry.info• Credit: Egon Willighagen

• Soon moving to ChEMBL 15– RDF published by EBI

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 8

Page 9: 2013 01-14 ops-dataset_descriptions

Challenges• Datasets available

– In many versions over time– In different formats– From many mirrors/registries

• Files do not carry metadata• Registries

– Can be out-of-date– Can contain conflicting information

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 9

Page 10: 2013 01-14 ops-dataset_descriptions

VoID: Vocabulary of Interlinked Datasets

• Describes RDF datasets– W3C Note: http://www.w3.org/TR/void/

• Metadata carried with data– Directly embedded or

linked (void:inDataset)• Problems

– Very generic– No checklist of requisite fields

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 10

Page 11: 2013 01-14 ops-dataset_descriptions

Provenance Vocabularies• Dublin Core Terms

– Widely used– Terms to generic to give proper credit

• “Date: A point or period of time associated with an event in the lifecycle of the resource.”

• PROV– New W3C standard: www.w3.org/2011/prov– Generic framework for exchanging data– Does not contain required predicates

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 11

Page 12: 2013 01-14 ops-dataset_descriptions

PAV: Provenance, Authoring and Versioning Vocabulary

http://code.google.com/p/pav-ontology/wiki/Homepage• Easy to understand predicates

– http://purl.org/pav/• Right level of granularity

– Distinguishes: author/creator/curator– Captures source of data:

• import/derived/accessed• version/previousVersion

• Being aligned with PROV-O14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 12

Page 13: 2013 01-14 ops-dataset_descriptions

Dataset Descriptions in the Open Pharmacological Space

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 13

Page 14: 2013 01-14 ops-dataset_descriptions

Related Work• Registries: DataHub, MIRIAM

– Do not tie metadata with the data– No checklist of attributes

• BioDBCore– Checklist

• Similar information captured• Includes point of contact information

– Not tied to the data

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 14

Page 15: 2013 01-14 ops-dataset_descriptions

Realisation of Dataset Descriptions

• Needs to be incorporated into data publishing pipeline

• Hard for publishers to provide conformant descriptions– Datasets are complex– Evolve over time– Seen as yet another burden

• Validation tool provided– http://openphacts.cs.man.ac.uk:9090/OPS-IMS/validate

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 15

Page 16: 2013 01-14 ops-dataset_descriptions

Future Vision• Provide rich and accurate

provenance trail of data– Alignment with BioDBCore

• One standard to rule them all– Automatic pipeline from VoID file to

registries• Write once, use many times

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 16

Page 17: 2013 01-14 ops-dataset_descriptions

Thank [email protected]/~graya/www.openphacts.org

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 17