methodology for linked data generation and publication daniel vila-suero [email protected]
TRANSCRIPT
Introduction
• Different methods and life-cycles available:
• LOD2• Datalift• W3C Linked Data cookbook• W3C Best practices for Linked Data• Guidelines for Multilingual Linked Data• "datos.bne.es: an insight into Library Linked Data"
Daniel Vila-Suero, Asunción Gómez-Pérez, Library Hi-tech Journal 2013
OEG LD Projects
Bibliotecas: Biblioteca Nacional
http://datos.bne.es
Geográfico
IGN: http://geo.linkeddata.es/
OTALEX
Meteorología:
AEMET: http://aemet.linkeddata.es/
Viajes:
Grupo Prisa : http://webenemasuno.linkeddata.es/
3
Linked Data lifecycles
• Methodological approach to Linked Data publication and management.
• Basically described as series of steps or activities and associated tasks, technologies and methods
• Several approaches are being proposed (Hausenblas et al., Hyland et al., Villazón-Terrazas et al. etc.) with a lot of similarities see http://www.w3.org/2011/gld/wiki/GLD_Life_cycle
• However, these different lifecycles should be understood as a set of practices that are currently applied not as one-size-fits-all formula [1]
4[1] Publishing Linked Data - There is no One-Size-Fits-All Formula http://mccarthy.dia.fi.upm.es/doc/edf.pdf
LD Lifecycles: Hyland et al.
5
LD Lifecycles: Hausenblas et al.
6
http://linked-data-life-cycles.info/
LD Lifecycles: Datalift approach
7
• French National project:• http://datalift.org/
• Activities:• Raw RDF Conversion• Selection of vocabularies• Convertion according to a schema• Publication• Interlinking• Exploitation
LD Lifecycles: LOD2 Approach
8
• Extraction• Storage• Authoring• Interlinking• Enrichment• Quality• Evolution• Exploration
• More info at:
http://lod2.eu/
LD Lifecycles: Villazón-Terrazas et al.
9
Guidelines for ML Linked Data
• Set of main activities:• Specification• Modelling• RDF Generation• Data curation• Linking• Publication
• Each activity composed of several tasks
Method and lifecycle
Specification: Intro
• The goal is to:• Specify and analyse the data sources• Produce documentation that will be used withing the next
activities
Specification: Tasks
1. Identification and analysis of data sources
2. URI/IRI design
3. Analysis and definition of licensing and provenance information
1. Specification
1. Analyze the data sources: What is the structure of your data? In which format? What type entities are described in the data?
2. URI Design: How will you name your resources?
• Several guidelines available:
• Linked Data Patterns (Dodds and Davis)
http://patterns.dataincubator.org/book/• UK Cabinet office:
http://www.cabinetoffice.gov.uk/media/308995/public_sector_uri.pdf
• Style Guidelines for Naming and Labelling Ontologies in the Multilingual Web: http://bit.ly/xJwA9g
14
1. Specification
3. Define/describe the provenance information: How will you express and track the sources and aggregations of resources?
• Different vocabularies available: OPMV, W3C PROV-O, OAI-ORE, DCMI-PROV, etc.
• Good starting point: PROV Model Primer
15
Specification
Identification and analysis of data sources
INPUT• Data sources• Associated documentation
OUTPUT • Documentation of data sources (type of data, formats, data
model, language, identifiers, etc.)
Specification
Identification and analysis of data sources
• Documentation of data sources:
• Type of data: Authors, Publications, Subjects, etc.
• Format: MARC21, XML, EXCEL, TSV, CSV, RDF
• Data model: MARC21, Ad-hoc RDB, XSD schema
Specification
Identification and analysis of data sources
• Documentation of data sources:
• Languages: English
• Identifiers: Labels of terms, code 001
• Licensing and provenance information:
Specification
URI/IRI design
INPUT:• Documentation of data sources
OUTPUT • Documentation of URI/IRI patterns and namespaces
URI Forms
• Publication of RDF should at least:• Have a de-referenceable URI that responds RDF (Turtle
preferably)• Have a de-referenceable URI that responds a human-
readable representation (HTML)
• Two URI strategies:• #{id}:
• A request to http://example.org/vocabulary#Person
Returns the complete vocab to the client and the client is responsible to find #Person
• HTTP 303 or / Slash strategie:• A request to http://example.org/vocabulary/Person• Returns HTTP 303 with the location of the document with
content-negotiation
20
Specification
URI/IRI design
- Namespace:
- http://www.datos.bne.es/
- prefix: bne:
- Patterns:
- Canonical
http://datos.bne.es/resource/{id}
- Persons: http://datos.bne.es/autor/{id}
- Works: http://datos.bne.es/obra/{id}
- Expressions: : http://datos.bne.es/versión/{id}
- Manifestations: : http://datos.bne.es/edición/{id}
- Subjects: http://datos.bne.es/edición/{id}
Specification
URI/IRI design
- Namespace vocabulary:
- http://www.datos.bne.es/vocab
- prefix: bnevoc:
- Patterns:
- Canonical
http://datos.bne.es/vocab/{id}
URI design = naming
Some general URI design guidelines
23
Naming: Preliminary guidelines for a multilingual scenario
24
Some tools are not prepared for opaque URIs (Pubby)…
25
Semantic Web Journal reviewer about datos.bne.es' paper* :
"It is pity that local names of chosen IFLA-FRBR properties are cryptic codes … but authors of this paper are not to blame about that"
* http://datos.bne.es/resource/XX1718747
* http://www.semantic-web-journal.net/content/datosbnees-library-linked-data-dataset
Some others are better prepared (Puelia)…
26
frbr:C1005 a rdfs:Class; rdfs:label "Person"@en, "Persona"@es
Display labels are configurable using a Turtle config file
* http://datos.bne.es/frontend/persons
Label not selected based on User's locale
Specification
Licensing and provenance information
INPUT:• Documentation of data sources
OUTPUT • Documentation of provenance and licensing terms
ODRLOpen Digital Rights Language
Exercise:
• Define the URI patterns for the following elements:
- Person
- Book
- Person is author of Book
- Person first name
- Person last name
- Book title
- Book is part of Collection
- Collection has part Book
28
How open is the LOD cloud?
[1] Rodriguez-Doncel, Victor et al., 2013. Rights declaration in Linked Data. in Proc. of the 3rd Int. W. on Consuming Linked Data O. Hartig et al. (Eds) CEUR vol. 1034 (2013)
Modelling: Intro
• The goal is to:• Design and implement a vocabulary for describing the data
sources in RDF• Produce a valid and LD compliant vocabulary that
facilitates the understanding and consumption of data
2. Modelling
2. Create vocabulary terms (classes, properties and relationships): if you are unable to find suitable terms, you need to create them• From an existent Conceptual Model• From scratch• By specializing existing terms (subPropertyOf, subClass of,
domain, range, etc.). For example:
myVocabulary:Archivist rdfs:subClassOf foaf:Person
• Desktop Tools:• Neon Toolkit• Protégé• Topbraid Composer Community Edition
• Online tools:• Metadata registry• Neologism (DERI)
31
Modelling: Tasks
1. Analysis and selection of domain vocabularies
2. Development of vocabulary
3. Vocabulary for representing licensing and provenance information
Modelling
Analysis and selection of domain vocabularies
INPUT:• Documentation of data sources (data models, type of data)
OUTPUT • A selection of standard/widely-used vocabularies
LOVLinked Open Vocabularies
2. Modelling
1. Search for suitable vocabularies to model the data sources: What properties and classes will you use to describe the RDF data?• Yes, but How can I find suitable vocabularies??
• Go to thedatahub.org to search for similar datasets and see what vocabularies are used
• Given a SPARQL endpoint use vocab-express• Given a set of keywords describing entities in your
source data go to LOV http://lov.okfn.org/ and use the search utility
• Open Metadata Registry http://metadataregistry.org• Ask the community: [email protected], public-
[email protected], [email protected]• What makes a vocabulary suitable?
Is your LD vocabulary 5-star?
34
Modelling
Analysis and selection of domain vocabularies
NIFNLP Interchange Format
IFLA
BIBO
DublinCore
Use http://lov.okfn.org/
Modelling
Development of vocabulary
INPUT:• Documentation of data sources (data model, type of data)• Selection of standard/widely-used vocabs
OUTPUT • Well-documented and linked vocabulary (with mappings to
widely-used vocabs)
Modelling
Development of vocabulary
• Integrate all reused vocabularies (possibly using subClassOf and subpropertyOf)
• Document the vocabulary using an Ontology Editor Tool
• Validate the vocabulary:
Example:
http://oeg-lia3.dia.fi.upm.es/oops/
Exercise:
• Create a small vocabulary in Protege with the following model:
- Person
- Book
- Person is author of Book
- Person first name
- Person last name
- Book title
- Book is part of Collection
- Collection has part Book
38
Exercise:
• Reuse and align with existing vocabularies if possible
• Validate using
39
Modelling
Vocabulary for licensing and provenance
INPUT:• Documentation of data sources (licensing and provenance)
OUTPUT • Selection of standard vocabs
ODRLOpen Digital Rights Language
PROVW3C Provenance Ontology
Guideline: Licensing models & mechanisms
Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1 DCAT
Data catalog vocabulary
Guideline: Licensing models & mechanisms
Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1
Use standard predicates to declare "rights" statements (e.g., Dublin Core terms: dc:rights, dct:license)2
DCATData catalog vocabulary
Guideline: Licensing models & mechanisms
Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1
Use standard predicates to declare "rights" statements (e.g., Dublin Core terms: dc:rights, dct:license)2
?3a
Standard license available
DCATData catalog vocabulary
Guideline: Licensing models & mechanisms
Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1
Use standard predicates to declare "rights" statements (e.g., Dublin Core terms: dc:rights, dct:license)2
?Yes
Use URI of standardlicense e.g., CC03a
Standard license available
DCATData catalog vocabulary
Guideline: Licensing models & mechanisms
Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1
Use standard predicates to declare "rights" statements (e.g., Dublin Core terms: dc:rights, dct:license)2
?Use rights declarationlanguage, e.g., ODRL
Yes
Use URI of standardlicense e.g., CC0 3b3a
No
Standard license available
ODRLOpen Digital Rights Language
DCATData catalog vocabulary
RDF Generation: Intro
• The goal is to:• Define the method, process and technologies to generate
RDF• Ideally, produce a set of explicit mappings from the data
sources to RDF
RDF Generation: Tasks
1. Selection, extension or development technologies for RDF generation
2. Mapping of data sources to RDF
3. Transformation of data sources to RDF
RDF Generation
Technologies for RDF Generation
INPUT:• Documentation of data sources (formats)
OUTPUT • Configuration of technologies (framework)
Open RefineJena Toolkit
any23
MorphR2rml engine
3. RDF Generation
1. Transform the data source to RDF• For that we need to know the model and format of the data
source.• Increasing amount of RDFizers:
• RDB: New W3C Recommendation R2RML and Direct Mapping.
• CSV, TSV, Tabular data: Google Refine, NOR2O• XML: GRRDL, XSLT, etc.• …
• Not always easy to map source data to RDF: Low usability
2. Data cleansing:• New tools for data quality, for example LDIF (Linked Data
Integration Framework)
49
RDF Generation
Technologies for RDF Generation
RDF Generation
Mapping of data sources to RDF
INPUT:• Vocabulary• Data sources • Configuration of technologies
OUTPUT • Ideally, machine-processable mappings• RDF dataset
Data curation: Intro
• Cross-cutting activity• The goal is to:
• Ensure the quality of the LD dataset• Enable data curation at the data sources level
Data curation: Tasks
1. Data sources curation
2. RDF data curation
Data curation
Data sources curation
INPUT:• Data sources• Reports from issues in previous activities
OUTPUT • Documentation of issues in data sources• Fixes to data sources
Data curation
RDF data curation
INPUT:• RDF dataset• Vocabulary
OUTPUT • Documentation of issues• Fixes to RDF dataset
RDFUnita Test-Driven Data Debugging framework
http://validator.linkeddata.org/vapour
RDF Data Curation
• Publication of RDF should at least:• Have a de-referenceable URI that responds RDF (Turtle
preferably)• Have a de-referenceable URI that responds a human-
readable representation (HTML)
• Two URI strategies:• #{id}:
• A request to http://example.org/vocabulary#Person
Returns the complete vocab to the client and the client is responsible to find #Person
• HTTP 303 or / Slash strategie:• A request to http://example.org/vocabulary/Person• Returns HTTP 303 with the location of the document with
content-negotiation
56
Exercise:
• Validate 5 diferent URIs from existing Linked Data projects using http://validator.linkeddata.org/vapour
• Search in the datahub for URIs to check if you don't know any dataset
• Analyse and explain the results
57
Linking: Intro
• The goal is to:• Maximize the connectivity to external datasets• Facilitate data discovery and enrichment
Linking: Tasks
1. Target datasets discovery and selection
2. Link discovery
3. Links validation
4. Linking
1. Identify datasets that may be suitable as linking targets:• Still difficult, most people think on DBPedia• Use thedatahub.org to find similar datasets• Use Sindice to search for entities in your dataset
2. Discover relationships between data items of your dataset and the items of the identified datasets in the previous step• Most well-known semi-automatic tools: SILK, LIMES• Still require high level of configuration
3. Validate the relationships that have been discovered in the previous step• Still difficult, maybe it should be included in
editorial/cataloguing workflows?
60
Linking
Target datasets discovery and selection
INPUT:• RDF dataset• Vocabulary
OUTPUT • Selection of target datasets• Commonalities with target datasets (data, vocabulary level)
Linking
Link discovery
INPUT:• RDF dataset• Selection of target datasets
OUTPUT • Linking tasks specification• Tentative Linksets
Open Refine Silk Limes
Linking
Link validation
INPUT:• Tentative Linksets
OUTPUT • Validated linksets (including provenance and licensing
information)
sameAsValidator
Publication: Intro
• The goal is to:• Make available the RDF dataset following LD best practices• Facilitate dataset discovery and consumption
Publication: Tasks
1. Dataset and vocabulary publication on the Web
2. Metadata definition and publication
Publication
Dataset and vocabulary publication on the Web
INPUT:• RDF dataset• Validated linksets• Vocabulary
OUTPUT • Linked Dataset available and accesible on the Web• Vocabulary available and accesible on the Web (as LD)
SPARQL Linked Data APIsPuelia, Elda, LDP implementations
Publication
Dataset and vocabulary publication on the Web
Example of LD vocabulary:
GND Ontology:
http://d-nb.info/standards/elementset/gnd
Publication
Metadata definition and publication
INPUT:• All previous assets and documentation
OUTPUT • Dataset registered in relevant catalogues• Metadata available, accesible and discoverable
DCATVoID
Publication
Metadata definition and publication
DCAT: http://www.w3.org/TR/vocab-dcat/
Example:
:catalog a dcat:Catalog ; dct:title "Imaginary Catalog" ; rdfs:label "Imaginary Catalog" ; foaf:homepage <http://example.org/catalog> ; dct:publisher :transparency-office ; dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ; dcat:dataset :dataset-001 , :dataset-002 , :dataset-003 ; .
Exercise:
• Create a DCAT description in Turtle syntax of the Book, Authors, Collections catalog:
• Book dataset• Authors dataset• Collections dataset• If possible put different licenses, languages, and
provenance.
70
Conclusions
• Data curation is key to success of LD• Documentation of data sources and issues• Language issues have to be taken into account
during the whole process• Metadata description is key for enabling reusing and
discovery• Vocabulary have to be documented and published
following LD BPs