Ontology-based multi-domain metadata for research data
management using triple stores
João Rocha da Silva [email protected]
Faculdade de Engenharia da
Universidade do Porto / INESC TEC
Cristina Ribeiro [email protected] DEI—Faculdade de
Engenharia da Universidade do
Porto / INESC TECJoão Correia Lopes [email protected]
IDEAS '14, July 07 - 09 2014, Porto, Portugal
Contents• Diverse metadata: relational modeling challenges
• Current approaches built on relational databases
• Dendro: graph-based research data management
• Live demo
• Conclusions
2
Problem: diverse metadataRelational modeling challenges
3
Analytical Chemistry Dataset
Mechanical Engineering Dataset …
GenericAuthor
Description Creation date
…
Author Description
Creation date …
…
Domain Specific
Sample Count Analysed Substance
…
Initial Crack Length Specimen Type
…
4
Common challenges in RDB schema modeling
• Entities with unknown attributes at time of modeling
• Time-variant attribute values
• Inheritance / sub-class mapping
• Resource hierarchies (parents of parents…)
• Schemas rely on external documentation5
Data management and description platforms
Study of relational models
6
DSpace
• Academic publications management platform
• Not targeted specifically at data
• More than 1000 active installations
• Mature open-source codebase
7
DSpace
• Designed for self-deposit by common users
• Good deposit workflow (validation, licensing…)
8
U.Porto Open Repository Homepage (http://repositorio-aberto.up.pt)
Powered by DSpace
9
Powered by DSpace
A thesis record in the repository (http://repositorio-aberto.up.pt/handle/10216/58508) 10
Bitstream Metadata Schema
Metadata Descriptor
Item
*
1**
metadata value
*
1
11
DSpace
12
• Metadata profiles for objects other than Items
• Descriptor hierarchy for specialization
• Collaborative schema derivation
• Validation of metadata completeness against different schemas
• Restricting possible metadata for each type of resource
New requirements
13
14
CKAN
• Open-source data publishing platform
• Deposit requires minimal metadata at first
• Flexible metadata model
• Open-Source
15
1
2
16
1
17
!source CKAN 18
!source CKAN 18
Entity with variable, time-dependent
attributes
Fixed attrs.
!source CKAN 18
Attribute name
Entity with variable, time-dependent
attributes
Fixed attrs.
!source CKAN 18
Attribute name
Value (always varchar)
Entity with variable, time-dependent
attributes
Fixed attrs.
!source CKAN 18
Attribute name
Timestamps
Value (always varchar)
Entity with variable, time-dependent
attributes
Fixed attrs.
!source CKAN 18
Invenio• Software behing Zenodo, a data publishing portal
• Static metadata model
• Very complex relational schema generated by business logic code
• Tight coupling between DB and code
• Open-Source
19
1
2
20
541 Tables
No FKs
!21
!22
!22
OntologiesSemantic annotation for richer metadata
23
24
!!!!!!
http://dendro.fe.up.pt/project/datanotes/data/base
%20data.xls
24
!!!!
http://dendro.fe.up.pt/project/datanotes/data
nie:isLogicalPartOf
!!!!!!
http://dendro.fe.up.pt/project/datanotes/data/base
%20data.xls
24
!!!!
http://dendro.fe.up.pt/project/datanotes/data
nie:isLogicalPartOf
rdf:type
nie:File
!!!!!!
http://dendro.fe.up.pt/project/datanotes/data/base
%20data.xls
24
!!!!
http://dendro.fe.up.pt/project/datanotes/data
nie:isLogicalPartOf
“Base data of the DCB experiments”
dc:titlerdf:type
nie:File
!!!!!!
http://dendro.fe.up.pt/project/datanotes/data/base
%20data.xls
24
!!!!
http://dendro.fe.up.pt/project/datanotes/data
nie:isLogicalPartOf
“Base data of the DCB experiments”
dc:title
base data.xls
nie:title
rdf:type
nie:File
!!!!!!
http://dendro.fe.up.pt/project/datanotes/data/base
%20data.xls
24
!!!!
http://dendro.fe.up.pt/project/datanotes/data
nie:isLogicalPartOf
“Base data of the DCB experiments”
dc:title
base data.xls
nie:title
rdf:type
nie:File
base data.xls
dcb:initialCrackLength
!!!!!!
http://dendro.fe.up.pt/project/datanotes/data/base
%20data.xls
24
Semantic MediaWiki• Semantic extension of MediaWiki, the code behind
Wikipedia
• Semantic Links between pages
• Uses ontologies
• Strong emphasis on page versioning
• DB schema built around the time dimension
25
Loading an ontology
26
Describing a resource
27
Semantic Forms
From DataNotes + UPBox http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20data%20management%20environment%20for%20the%20long%20tail%20of%20research%20data.pdf
28
Semantic Forms
From DataNotes + UPBox http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20data%20management%20environment%20for%20the%20long%20tail%20of%20research%20data.pdf
29
Semantic Forms
From DataNotes + UPBox http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20data%20management%20environment%20for%20the%20long%20tail%20of%20research%20data.pdf
30
31!
source MediaWiki
“Old Versions” aka “copy everything and add a timestamp” 31
!source MediaWiki
!source MediaWiki
now imagine we want images of different kinds, with different attributes…
32
Redundancy…
Relational Database (MySQL)
Triple Store (Apache
Jena)Mapping Logic
33
CKAN
DSpace
Invenio
Semantic MediaWiki
Time
Flexible attributes
Wide use
DB-code coupling
34
Issues review• Entities with unknown attributes at time of modeling
• Time-variant attribute values
• Inheritance / sub-classing
• Hierarchies (parents of parents of parents…)
• Need for external documentation
35
Dendroa graph-based data management platform
36
Graph databases • Represent entities (Users, Products, Places…) as
vertexes (entity types are called classes)
• Connections between them are directed graph edges (edge types are called properties)
!
• The meaning of these connections is expressed in ontologies that can be shared and reused
37
Getting all my Projects
• Will fetch all the projects created by the user
• Will also return their attributes (“database columns”)
• Different projects may have different attributes38
Inference
• Transitive Properties
• Subclasses
• Multiple Inheritance
•Resource can be a Folder and a Dataset at the same time)
39
Loading an ontology
• Load ontology straight from the web
• No platform-specific syntax (like in SMW)
40
Nothing comes for free• Aggregation operators slow
• No ACID properties
• Transactions are not supported in standard SPARQL
• (“SPARQL 1.1 Query/Update Services should be atomic but that they are not required to be atomic.”)
• Graph DBMS Solutions are in early stages (many bugs, many “beta”s, many mailing lists…)
41
Dendro • Dropbox and File/Folder description platform
• Variable descriptions
• Time-dependent values
• Directory structures (hierarchy)
• Need for simple querying…
42
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferencedBy
Fn
120
dc:title
dcb:specimenWidth
dc:isVersionOf
Added propertyinstance
01/01/2014^^xsd:date
dc:created
01/01/2014^^xsd:date
dc:modified
Changedmodificationtimestamp
Revision creation
timestamp
Un
dc:creator
Current dataset version Past Revisions
ddr:pertainsTo
Change recording
C
ddr:initialCrackLen
gth
ddr:changedDescriptor
“add”
ddr:operation
“DCB Base Data”
43
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferencedBy
Fn
120
dc:title
dcb:specimenWidth
dc:isVersionOf
Added propertyinstance
01/01/2014^^xsd:date
dc:created
01/01/2014^^xsd:date
dc:modified
Changedmodificationtimestamp
Revision creation
timestamp
Un
dc:creator
Current dataset version Past Revisions
ddr:pertainsTo
Change recording
C
ddr:initialCrackLen
gth
ddr:changedDescriptor
“add”
ddr:operation
“DCB Base Data”
43
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferencedBy
Fn
120
dc:title
dcb:specimenWidth
dc:isVersionOf
Added propertyinstance
01/01/2014^^xsd:date
dc:created
01/01/2014^^xsd:date
dc:modified
Changedmodificationtimestamp
Revision creation
timestamp
Un
dc:creator
Current dataset version Past Revisions
ddr:pertainsTo
Change recording
C
ddr:initialCrackLen
gth
ddr:changedDescriptor
“add”
ddr:operation
“DCB Base Data”
43
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferencedBy
Fn
120
dc:title
dcb:specimenWidth
dc:isVersionOf
Added propertyinstance
01/01/2014^^xsd:date
dc:created
01/01/2014^^xsd:date
dc:modified
Changedmodificationtimestamp
Revision creation
timestamp
Un
dc:creator
Current dataset version Past Revisions
ddr:pertainsTo
Change recording
C
ddr:initialCrackLen
gth
ddr:changedDescriptor
“add”
ddr:operation
“DCB Base Data”
43
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferencedBy
Fn
120
dc:title
dcb:specimenWidth
dc:isVersionOf
Added propertyinstance
01/01/2014^^xsd:date
dc:created
01/01/2014^^xsd:date
dc:modified
Changedmodificationtimestamp
Revision creation
timestamp
Un
dc:creator
Current dataset version Past Revisions
ddr:pertainsTo
Change recording
C
ddr:initialCrackLen
gth
ddr:changedDescriptor
“add”
ddr:operation
“DCB Base Data”
43
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferencedBy
Fn
120
dc:title
dcb:specimenWidth
dc:isVersionOf
Added propertyinstance
01/01/2014^^xsd:date
dc:created
01/01/2014^^xsd:date
dc:modified
Changedmodificationtimestamp
Revision creation
timestamp
Un
dc:creator
Current dataset version Past Revisions
ddr:pertainsTo
Change recording
C
ddr:initialCrackLen
gth
ddr:changedDescriptor
“add”
ddr:operation
“DCB Base Data”
43
Demo
Dendroβ
44
Conclusions• Recording rich metadata requires data model
flexibility
• Unknown attributes, time-variant information or hierarchies can be hard to model in a relational database
• Several current solutions make compromises due to their relational database layer
45
Conclusions (cont’d)• Graph-based models are more flexible and easily
expansible through ontology loading
• Ontologies are shareable on the web, and document the database “schema”
• Queries become simpler due to the graph model’s ability to easily model challenging scenarios for RDBs
• Dendro is a collaborative data management platform fully built on a graph model
46
João Rocha da Silva is an Informatics Engineering PhD student at the Faculty of Engineering of the University of Porto. He specializes on research data management, applying the latest Semantic Web Technologies to the adequate preservation and discovery of research data assets.!!He is also an experienced freelancer iOS Developer with several Apps published on the App Store, and a self-taught DIY mechanic with a special interest in classic cars, particularly his 1987 Toyota Corolla GT Twin Cam, also known as Hachi-Roku or AE86.!
Research Data Management and Semantic Web Researcher, Web & iPhone DeveloperJoão Rocha da Silva!
João Correia Lopes is an Assistant Professor in Informatics Engineering at Universidade do Porto and a researcher at INESC TEC. He has graduated in Electrical Engineering in the University of Porto in 1984 and holds a PhD in Computing Science by Glasgow University in1997. His teaching includes undergraduate and graduate courses in databases and web applications, software engineering and object-oriented programming, markup languages and semantic web. He has been involved in research projects in the area of long-term preservation, service-oriented architectures and e-Science. Currently his main research interests are e-Science and the management of research data.
Cristina Ribeiro is an Assistant Professor in Informatics Engineering at Universidade do Porto and a researcher at INESC TEC. She has graduated in Electrical Engineering, holds a Master in Electrical and Computer Engineering and a Ph.D. in Informatics. Her teaching includes undergraduate and graduate courses in information retrieval, digital libraries, knowledge representation and markup languages. She has been involved in research projects in the areas of cultural heritage, multimedia databases and information retrieval. Currently her main research interests are information retrieval, digital preservation and the management of research data.
Assistant Professor in Informatics Engineering at Universidade do Porto, Researcher at INESC TECCristina Ribeiro!
Assistant Professor in Informatics Engineering at Universidade do Porto, Researcher at INESC TECJoão Correia Lopes!