methodology for linked data generation and publication daniel vila-suero [email protected]

Methodology for Linked Data Generation and publication

Daniel Vila-Suero

[email protected]

Introduction

• Different methods and life-cycles available:

• LOD2• Datalift• W3C Linked Data cookbook• W3C Best practices for Linked Data• Guidelines for Multilingual Linked Data• "datos.bne.es: an insight into Library Linked Data"

Daniel Vila-Suero, Asunción Gómez-Pérez, Library Hi-tech Journal 2013

OEG LD Projects

Bibliotecas: Biblioteca Nacional

http://datos.bne.es

Geográfico

IGN: http://geo.linkeddata.es/

OTALEX

Meteorología:

AEMET: http://aemet.linkeddata.es/

Viajes:

Grupo Prisa : http://webenemasuno.linkeddata.es/

3

Linked Data lifecycles

• Methodological approach to Linked Data publication and management.

• Basically described as series of steps or activities and associated tasks, technologies and methods

• Several approaches are being proposed (Hausenblas et al., Hyland et al., Villazón-Terrazas et al. etc.) with a lot of similarities see http://www.w3.org/2011/gld/wiki/GLD_Life_cycle

• However, these different lifecycles should be understood as a set of practices that are currently applied not as one-size-fits-all formula [1]

4[1] Publishing Linked Data - There is no One-Size-Fits-All Formula http://mccarthy.dia.fi.upm.es/doc/edf.pdf

http://www.w3.org/2011/gld/wiki/GLD_Life_cycle

http://mccarthy.dia.fi.upm.es/doc/edf.pdf

http://mccarthy.dia.fi.upm.es/doc/edf.pdf

LD Lifecycles: Hyland et al.

5

LD Lifecycles: Hausenblas et al.

6

http://linked-data-life-cycles.info/




LD Lifecycles: Datalift approach

7

• French National project:• http://datalift.org/

• Activities:• Raw RDF Conversion• Selection of vocabularies• Convertion according to a schema• Publication• Interlinking• Exploitation

LD Lifecycles: LOD2 Approach

8

• Extraction• Storage• Authoring• Interlinking• Enrichment• Quality• Evolution• Exploration

• More info at:

http://lod2.eu/

http://lod2.eu/

LD Lifecycles: Villazón-Terrazas et al.

9

Guidelines for ML Linked Data

• Set of main activities:• Specification• Modelling• RDF Generation• Data curation• Linking• Publication

• Each activity composed of several tasks

Method and lifecycle

Specification: Intro

• The goal is to:• Specify and analyse the data sources• Produce documentation that will be used withing the next

activities

Specification: Tasks

1. Identification and analysis of data sources

2. URI/IRI design

3. Analysis and definition of licensing and provenance information

1. Specification

1. Analyze the data sources: What is the structure of your data? In which format? What type entities are described in the data?

2. URI Design: How will you name your resources?

• Several guidelines available:

• Linked Data Patterns (Dodds and Davis)

http://patterns.dataincubator.org/book/• UK Cabinet office:

http://www.cabinetoffice.gov.uk/media/308995/public_sector_uri.pdf

• Style Guidelines for Naming and Labelling Ontologies in the Multilingual Web: http://bit.ly/xJwA9g

14

http://patterns.dataincubator.org/book/

http://www.cabinetoffice.gov.uk/media/308995/%20public_sector_uri.pdf

http://www.cabinetoffice.gov.uk/media/308995/%20public_sector_uri.pdf

http://bit.ly/xJwA9g

1. Specification

3. Define/describe the provenance information: How will you express and track the sources and aggregations of resources?

• Different vocabularies available: OPMV, W3C PROV-O, OAI-ORE, DCMI-PROV, etc.

• Good starting point: PROV Model Primer

15

http://www.w3.org/TR/prov-primer/

Specification

Identification and analysis of data sources

INPUT• Data sources• Associated documentation

OUTPUT • Documentation of data sources (type of data, formats, data

model, language, identifiers, etc.)

Specification


• Documentation of data sources:

• Type of data: Authors, Publications, Subjects, etc.

• Format: MARC21, XML, EXCEL, TSV, CSV, RDF

• Data model: MARC21, Ad-hoc RDB, XSD schema

Specification


• Documentation of data sources:

• Languages: English

• Identifiers: Labels of terms, code 001

• Licensing and provenance information:

Specification

URI/IRI design

INPUT:• Documentation of data sources

OUTPUT • Documentation of URI/IRI patterns and namespaces

URI Forms

• Publication of RDF should at least:• Have a de-referenceable URI that responds RDF (Turtle

preferably)• Have a de-referenceable URI that responds a human-

readable representation (HTML)

• Two URI strategies:• #{id}:

• A request to http://example.org/vocabulary#Person

Returns the complete vocab to the client and the client is responsible to find #Person

• HTTP 303 or / Slash strategie:• A request to http://example.org/vocabulary/Person• Returns HTTP 303 with the location of the document with

content-negotiation

20

Specification

URI/IRI design

- Namespace:

- http://www.datos.bne.es/

- prefix: bne:

- Patterns:

- Canonical

http://datos.bne.es/resource/{id}

- Persons: http://datos.bne.es/autor/{id}

- Works: http://datos.bne.es/obra/{id}

- Expressions: : http://datos.bne.es/versión/{id}

- Manifestations: : http://datos.bne.es/edición/{id}

- Subjects: http://datos.bne.es/edición/{id}

Specification

URI/IRI design

- Namespace vocabulary:

- http://www.datos.bne.es/vocab

- prefix: bnevoc:

- Patterns:

- Canonical

http://datos.bne.es/vocab/{id}

URI design = naming

Some general URI design guidelines

23

Naming: Preliminary guidelines for a multilingual scenario

24

Some tools are not prepared for opaque URIs (Pubby)…

25

Semantic Web Journal reviewer about datos.bne.es' paper* :

"It is pity that local names of chosen IFLA-FRBR properties are cryptic codes … but authors of this paper are not to blame about that"

* http://datos.bne.es/resource/XX1718747

* http://www.semantic-web-journal.net/content/datosbnees-library-linked-data-dataset

Some others are better prepared (Puelia)…

26

frbr:C1005 a rdfs:Class; rdfs:label "Person"@en, "Persona"@es

Display labels are configurable using a Turtle config file

* http://datos.bne.es/frontend/persons

Label not selected based on User's locale

Specification

Licensing and provenance information

INPUT:• Documentation of data sources

OUTPUT • Documentation of provenance and licensing terms

ODRLOpen Digital Rights Language

Exercise:

• Define the URI patterns for the following elements:

- Person

- Book

- Person is author of Book

- Person first name

- Person last name

- Book title

- Book is part of Collection

- Collection has part Book

28

How open is the LOD cloud?

[1] Rodriguez-Doncel, Victor et al., 2013. Rights declaration in Linked Data. in Proc. of the 3rd Int. W. on Consuming Linked Data O. Hartig et al. (Eds) CEUR vol. 1034 (2013)

Modelling: Intro

• The goal is to:• Design and implement a vocabulary for describing the data

sources in RDF• Produce a valid and LD compliant vocabulary that

facilitates the understanding and consumption of data

2. Modelling

2. Create vocabulary terms (classes, properties and relationships): if you are unable to find suitable terms, you need to create them• From an existent Conceptual Model• From scratch• By specializing existing terms (subPropertyOf, subClass of,

domain, range, etc.). For example:

myVocabulary:Archivist rdfs:subClassOf foaf:Person

• Desktop Tools:• Neon Toolkit• Protégé• Topbraid Composer Community Edition

• Online tools:• Metadata registry• Neologism (DERI)

31

http://wiki.metadataregistry.org/Step-By-Step_Instruction

http://neologism.deri.ie/

Modelling: Tasks

1. Analysis and selection of domain vocabularies

2. Development of vocabulary

3. Vocabulary for representing licensing and provenance information

Modelling

Analysis and selection of domain vocabularies

INPUT:• Documentation of data sources (data models, type of data)

OUTPUT • A selection of standard/widely-used vocabularies

LOVLinked Open Vocabularies

2. Modelling

1. Search for suitable vocabularies to model the data sources: What properties and classes will you use to describe the RDF data?• Yes, but How can I find suitable vocabularies??

• Go to thedatahub.org to search for similar datasets and see what vocabularies are used

• Given a SPARQL endpoint use vocab-express• Given a set of keywords describing entities in your

source data go to LOV http://lov.okfn.org/ and use the search utility

• Open Metadata Registry http://metadataregistry.org• Ask the community: [email protected], public-

[email protected], [email protected]• What makes a vocabulary suitable?

Is your LD vocabulary 5-star?

34

http://thedatahub.org/

http://thedatahub.org/

http://vocab-express.nodester.com/

http://vocab-express.nodester.com/

http://lov.okfn.org/

http://lov.okfn.org/

http://metadataregistry.org/

http://blog.hubjects.com/2012/02/is-your-linked-data-vocabulary-5-star_9588.html

Modelling

Analysis and selection of domain vocabularies

NIFNLP Interchange Format

IFLA

BIBO

DublinCore

Use http://lov.okfn.org/

Modelling

Development of vocabulary

INPUT:• Documentation of data sources (data model, type of data)• Selection of standard/widely-used vocabs

OUTPUT • Well-documented and linked vocabulary (with mappings to

widely-used vocabs)

Modelling

Development of vocabulary

• Integrate all reused vocabularies (possibly using subClassOf and subpropertyOf)

• Document the vocabulary using an Ontology Editor Tool

• Validate the vocabulary:

Example:

http://oeg-lia3.dia.fi.upm.es/oops/

Exercise:

• Create a small vocabulary in Protege with the following model:

- Person

- Book

- Person is author of Book

- Person first name

- Person last name

- Book title

- Book is part of Collection

- Collection has part Book

38

Exercise:

• Reuse and align with existing vocabularies if possible

• Validate using

39

Modelling

Vocabulary for licensing and provenance

INPUT:• Documentation of data sources (licensing and provenance)

OUTPUT • Selection of standard vocabs


PROVW3C Provenance Ontology

Guideline: Licensing models & mechanisms

Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1 DCAT

Data catalog vocabulary


Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1

Use standard predicates to declare "rights" statements (e.g., Dublin Core terms: dc:rights, dct:license)2

DCATData catalog vocabulary




?3a

Standard license available





?Yes

Use URI of standardlicense e.g., CC03a






?Use rights declarationlanguage, e.g., ODRL

Yes

Use URI of standardlicense e.g., CC0 3b3a

No




RDF Generation: Intro

• The goal is to:• Define the method, process and technologies to generate

RDF• Ideally, produce a set of explicit mappings from the data

sources to RDF

RDF Generation: Tasks

1. Selection, extension or development technologies for RDF generation

2. Mapping of data sources to RDF

3. Transformation of data sources to RDF

RDF Generation

Technologies for RDF Generation

INPUT:• Documentation of data sources (formats)

OUTPUT • Configuration of technologies (framework)

Open RefineJena Toolkit

any23

MorphR2rml engine

3. RDF Generation

1. Transform the data source to RDF• For that we need to know the model and format of the data

source.• Increasing amount of RDFizers:

• RDB: New W3C Recommendation R2RML and Direct Mapping.

• CSV, TSV, Tabular data: Google Refine, NOR2O• XML: GRRDL, XSLT, etc.• …

• Not always easy to map source data to RDF: Low usability

2. Data cleansing:• New tools for data quality, for example LDIF (Linked Data

Integration Framework)

49

RDF Generation

Technologies for RDF Generation

RDF Generation

Mapping of data sources to RDF

INPUT:• Vocabulary• Data sources • Configuration of technologies

OUTPUT • Ideally, machine-processable mappings• RDF dataset

Data curation: Intro

• Cross-cutting activity• The goal is to:

• Ensure the quality of the LD dataset• Enable data curation at the data sources level

Data curation: Tasks

1. Data sources curation

2. RDF data curation

Data curation

Data sources curation

INPUT:• Data sources• Reports from issues in previous activities

OUTPUT • Documentation of issues in data sources• Fixes to data sources

Data curation

RDF data curation

INPUT:• RDF dataset• Vocabulary

OUTPUT • Documentation of issues• Fixes to RDF dataset

RDFUnita Test-Driven Data Debugging framework

http://validator.linkeddata.org/vapour

RDF Data Curation

• Publication of RDF should at least:• Have a de-referenceable URI that responds RDF (Turtle

preferably)• Have a de-referenceable URI that responds a human-

readable representation (HTML)

• Two URI strategies:• #{id}:

• A request to http://example.org/vocabulary#Person

Returns the complete vocab to the client and the client is responsible to find #Person

• HTTP 303 or / Slash strategie:• A request to http://example.org/vocabulary/Person• Returns HTTP 303 with the location of the document with

content-negotiation

56

Exercise:

• Validate 5 diferent URIs from existing Linked Data projects using http://validator.linkeddata.org/vapour

• Search in the datahub for URIs to check if you don't know any dataset

• Analyse and explain the results

57

Linking: Intro

• The goal is to:• Maximize the connectivity to external datasets• Facilitate data discovery and enrichment

Linking: Tasks

1. Target datasets discovery and selection

2. Link discovery

3. Links validation

4. Linking

1. Identify datasets that may be suitable as linking targets:• Still difficult, most people think on DBPedia• Use thedatahub.org to find similar datasets• Use Sindice to search for entities in your dataset

2. Discover relationships between data items of your dataset and the items of the identified datasets in the previous step• Most well-known semi-automatic tools: SILK, LIMES• Still require high level of configuration

3. Validate the relationships that have been discovered in the previous step• Still difficult, maybe it should be included in

editorial/cataloguing workflows?

60

Linking

Target datasets discovery and selection

INPUT:• RDF dataset• Vocabulary

OUTPUT • Selection of target datasets• Commonalities with target datasets (data, vocabulary level)

Linking

Link discovery

INPUT:• RDF dataset• Selection of target datasets

OUTPUT • Linking tasks specification• Tentative Linksets

Open Refine Silk Limes

Linking

Link validation

INPUT:• Tentative Linksets

OUTPUT • Validated linksets (including provenance and licensing

information)

sameAsValidator

Publication: Intro

• The goal is to:• Make available the RDF dataset following LD best practices• Facilitate dataset discovery and consumption

Publication: Tasks

1. Dataset and vocabulary publication on the Web

2. Metadata definition and publication

Publication

Dataset and vocabulary publication on the Web

INPUT:• RDF dataset• Validated linksets• Vocabulary

OUTPUT • Linked Dataset available and accesible on the Web• Vocabulary available and accesible on the Web (as LD)

SPARQL Linked Data APIsPuelia, Elda, LDP implementations

Publication

Dataset and vocabulary publication on the Web

Example of LD vocabulary:

GND Ontology:

http://d-nb.info/standards/elementset/gnd

Publication

Metadata definition and publication

INPUT:• All previous assets and documentation

OUTPUT • Dataset registered in relevant catalogues• Metadata available, accesible and discoverable

DCATVoID

Publication

Metadata definition and publication

DCAT: http://www.w3.org/TR/vocab-dcat/

Example:

:catalog a dcat:Catalog ; dct:title "Imaginary Catalog" ; rdfs:label "Imaginary Catalog" ; foaf:homepage <http://example.org/catalog> ; dct:publisher :transparency-office ; dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ; dcat:dataset :dataset-001 , :dataset-002 , :dataset-003 ; .

Exercise:

• Create a DCAT description in Turtle syntax of the Book, Authors, Collections catalog:

• Book dataset• Authors dataset• Collections dataset• If possible put different licenses, languages, and

provenance.

70

Conclusions

• Data curation is key to success of LD• Documentation of data sources and issues• Language issues have to be taken into account

during the whole process• Metadata description is key for enabling reusing and

discovery• Vocabulary have to be documented and published

following LD BPs

methodology for linked data generation and publication daniel vila-suero [email protected]

Documents

linked data publication

linked data guidelines

linked data generation

multilingual linked

formula http

lyxjwa9g http

davis http

lifecycle slide