methodology for linked data generation and publication daniel vila-suero [email protected]

71
Methodology for Linked Data Generation and publication Daniel Vila-Suero [email protected]

Upload: randell-dawson

Post on 18-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Methodology for Linked Data Generation and publication

Daniel Vila-Suero

[email protected]

Page 2: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Introduction

• Different methods and life-cycles available:

• LOD2• Datalift• W3C Linked Data cookbook• W3C Best practices for Linked Data• Guidelines for Multilingual Linked Data• "datos.bne.es: an insight into Library Linked Data"

Daniel Vila-Suero, Asunción Gómez-Pérez, Library Hi-tech Journal 2013

Page 3: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

OEG LD Projects

Bibliotecas: Biblioteca Nacional

http://datos.bne.es

Geográfico

IGN: http://geo.linkeddata.es/

OTALEX

Meteorología:

AEMET: http://aemet.linkeddata.es/

Viajes:

Grupo Prisa : http://webenemasuno.linkeddata.es/

3

Page 4: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Linked Data lifecycles

• Methodological approach to Linked Data publication and management.

• Basically described as series of steps or activities and associated tasks, technologies and methods

• Several approaches are being proposed (Hausenblas et al., Hyland et al., Villazón-Terrazas et al. etc.) with a lot of similarities see http://www.w3.org/2011/gld/wiki/GLD_Life_cycle

• However, these different lifecycles should be understood as a set of practices that are currently applied not as one-size-fits-all formula [1]

4[1] Publishing Linked Data - There is no One-Size-Fits-All Formula http://mccarthy.dia.fi.upm.es/doc/edf.pdf

Page 5: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

LD Lifecycles: Hyland et al.

5

Page 6: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

LD Lifecycles: Hausenblas et al.

6

http://linked-data-life-cycles.info/

Page 7: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

LD Lifecycles: Datalift approach

7

• French National project:• http://datalift.org/

• Activities:• Raw RDF Conversion• Selection of vocabularies• Convertion according to a schema• Publication• Interlinking• Exploitation

Page 8: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

LD Lifecycles: LOD2 Approach

8

• Extraction• Storage• Authoring• Interlinking• Enrichment• Quality• Evolution• Exploration

• More info at:

http://lod2.eu/

Page 9: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

LD Lifecycles: Villazón-Terrazas et al.

9

Page 10: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Guidelines for ML Linked Data

• Set of main activities:• Specification• Modelling• RDF Generation• Data curation• Linking• Publication

• Each activity composed of several tasks

Page 11: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Method and lifecycle

Page 12: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Specification: Intro

• The goal is to:• Specify and analyse the data sources• Produce documentation that will be used withing the next

activities

Page 13: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Specification: Tasks

1. Identification and analysis of data sources

2. URI/IRI design

3. Analysis and definition of licensing and provenance information

Page 14: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

1. Specification

1. Analyze the data sources: What is the structure of your data? In which format? What type entities are described in the data?

2. URI Design: How will you name your resources?

• Several guidelines available:

• Linked Data Patterns (Dodds and Davis)

http://patterns.dataincubator.org/book/• UK Cabinet office:

http://www.cabinetoffice.gov.uk/media/308995/public_sector_uri.pdf

• Style Guidelines for Naming and Labelling Ontologies in the Multilingual Web: http://bit.ly/xJwA9g

14

Page 15: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

1. Specification

3. Define/describe the provenance information: How will you express and track the sources and aggregations of resources?

• Different vocabularies available: OPMV, W3C PROV-O, OAI-ORE, DCMI-PROV, etc.

• Good starting point: PROV Model Primer

15

Page 16: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Specification

Identification and analysis of data sources

INPUT• Data sources• Associated documentation

OUTPUT • Documentation of data sources (type of data, formats, data

model, language, identifiers, etc.)

Page 17: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Specification

Identification and analysis of data sources

• Documentation of data sources:

• Type of data: Authors, Publications, Subjects, etc.

• Format: MARC21, XML, EXCEL, TSV, CSV, RDF

• Data model: MARC21, Ad-hoc RDB, XSD schema

Page 18: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Specification

Identification and analysis of data sources

• Documentation of data sources:

• Languages: English

• Identifiers: Labels of terms, code 001

• Licensing and provenance information:

Page 19: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Specification

URI/IRI design

INPUT:• Documentation of data sources

OUTPUT • Documentation of URI/IRI patterns and namespaces

Page 20: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

URI Forms

• Publication of RDF should at least:• Have a de-referenceable URI that responds RDF (Turtle

preferably)• Have a de-referenceable URI that responds a human-

readable representation (HTML)

• Two URI strategies:• #{id}:

• A request to http://example.org/vocabulary#Person

Returns the complete vocab to the client and the client is responsible to find #Person

• HTTP 303 or / Slash strategie:• A request to http://example.org/vocabulary/Person• Returns HTTP 303 with the location of the document with

content-negotiation

20

Page 21: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Specification

URI/IRI design

- Namespace:

- http://www.datos.bne.es/

- prefix: bne:

- Patterns:

- Canonical

http://datos.bne.es/resource/{id}

- Persons: http://datos.bne.es/autor/{id}

- Works: http://datos.bne.es/obra/{id}

- Expressions: : http://datos.bne.es/versión/{id}

- Manifestations: : http://datos.bne.es/edición/{id}

- Subjects: http://datos.bne.es/edición/{id}

Page 22: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Specification

URI/IRI design

- Namespace vocabulary:

- http://www.datos.bne.es/vocab

- prefix: bnevoc:

- Patterns:

- Canonical

http://datos.bne.es/vocab/{id}

Page 23: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

URI design = naming

Some general URI design guidelines

23

Page 24: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Naming: Preliminary guidelines for a multilingual scenario

24

Page 25: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Some tools are not prepared for opaque URIs (Pubby)…

25

Semantic Web Journal reviewer about datos.bne.es' paper* :

"It is pity that local names of chosen IFLA-FRBR properties are cryptic codes … but authors of this paper are not to blame about that"

* http://datos.bne.es/resource/XX1718747

* http://www.semantic-web-journal.net/content/datosbnees-library-linked-data-dataset

Page 26: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Some others are better prepared (Puelia)…

26

frbr:C1005 a rdfs:Class; rdfs:label "Person"@en, "Persona"@es

Display labels are configurable using a Turtle config file

* http://datos.bne.es/frontend/persons

Label not selected based on User's locale

Page 27: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Specification

Licensing and provenance information

INPUT:• Documentation of data sources

OUTPUT • Documentation of provenance and licensing terms

ODRLOpen Digital Rights Language

Page 28: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Exercise:

• Define the URI patterns for the following elements:

- Person

- Book

- Person is author of Book

- Person first name

- Person last name

- Book title

- Book is part of Collection

- Collection has part Book

28

Page 29: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

How open is the LOD cloud?

[1] Rodriguez-Doncel, Victor et al., 2013. Rights declaration in Linked Data. in Proc. of the 3rd Int. W. on Consuming Linked Data O. Hartig et al. (Eds) CEUR vol. 1034 (2013)

Page 30: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Modelling: Intro

• The goal is to:• Design and implement a vocabulary for describing the data

sources in RDF• Produce a valid and LD compliant vocabulary that

facilitates the understanding and consumption of data

Page 31: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

2. Modelling

2. Create vocabulary terms (classes, properties and relationships): if you are unable to find suitable terms, you need to create them• From an existent Conceptual Model• From scratch• By specializing existing terms (subPropertyOf, subClass of,

domain, range, etc.). For example:

myVocabulary:Archivist rdfs:subClassOf foaf:Person

• Desktop Tools:• Neon Toolkit• Protégé• Topbraid Composer Community Edition

• Online tools:• Metadata registry• Neologism (DERI)

31

Page 32: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Modelling: Tasks

1. Analysis and selection of domain vocabularies

2. Development of vocabulary

3. Vocabulary for representing licensing and provenance information

Page 33: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Modelling

Analysis and selection of domain vocabularies

INPUT:• Documentation of data sources (data models, type of data)

OUTPUT • A selection of standard/widely-used vocabularies

LOVLinked Open Vocabularies

Page 34: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

2. Modelling

1. Search for suitable vocabularies to model the data sources: What properties and classes will you use to describe the RDF data?• Yes, but How can I find suitable vocabularies??

• Go to thedatahub.org to search for similar datasets and see what vocabularies are used

• Given a SPARQL endpoint use vocab-express• Given a set of keywords describing entities in your

source data go to LOV http://lov.okfn.org/ and use the search utility

• Open Metadata Registry http://metadataregistry.org• Ask the community: [email protected], public-

[email protected], [email protected]• What makes a vocabulary suitable?

Is your LD vocabulary 5-star?

34

Page 35: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Modelling

Analysis and selection of domain vocabularies

NIFNLP Interchange Format

IFLA

BIBO

DublinCore

Use http://lov.okfn.org/

Page 36: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Modelling

Development of vocabulary

INPUT:• Documentation of data sources (data model, type of data)• Selection of standard/widely-used vocabs

OUTPUT • Well-documented and linked vocabulary (with mappings to

widely-used vocabs)

Page 37: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Modelling

Development of vocabulary

• Integrate all reused vocabularies (possibly using subClassOf and subpropertyOf)

• Document the vocabulary using an Ontology Editor Tool

• Validate the vocabulary:

Example:

http://oeg-lia3.dia.fi.upm.es/oops/

Page 38: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Exercise:

• Create a small vocabulary in Protege with the following model:

- Person

- Book

- Person is author of Book

- Person first name

- Person last name

- Book title

- Book is part of Collection

- Collection has part Book

38

Page 39: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Exercise:

• Reuse and align with existing vocabularies if possible

• Validate using

39

Page 40: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Modelling

Vocabulary for licensing and provenance

INPUT:• Documentation of data sources (licensing and provenance)

OUTPUT • Selection of standard vocabs

ODRLOpen Digital Rights Language

PROVW3C Provenance Ontology

Page 41: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Guideline: Licensing models & mechanisms

Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1 DCAT

Data catalog vocabulary

Page 42: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Guideline: Licensing models & mechanisms

Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1

Use standard predicates to declare "rights" statements (e.g., Dublin Core terms: dc:rights, dct:license)2

DCATData catalog vocabulary

Page 43: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Guideline: Licensing models & mechanisms

Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1

Use standard predicates to declare "rights" statements (e.g., Dublin Core terms: dc:rights, dct:license)2

?3a

Standard license available

DCATData catalog vocabulary

Page 44: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Guideline: Licensing models & mechanisms

Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1

Use standard predicates to declare "rights" statements (e.g., Dublin Core terms: dc:rights, dct:license)2

?Yes

Use URI of standardlicense e.g., CC03a

Standard license available

DCATData catalog vocabulary

Page 45: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Guideline: Licensing models & mechanisms

Add "rights" metadata in the dataset description(e.g., VoID, DCAT)1

Use standard predicates to declare "rights" statements (e.g., Dublin Core terms: dc:rights, dct:license)2

?Use rights declarationlanguage, e.g., ODRL

Yes

Use URI of standardlicense e.g., CC0 3b3a

No

Standard license available

ODRLOpen Digital Rights Language

DCATData catalog vocabulary

Page 46: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

RDF Generation: Intro

• The goal is to:• Define the method, process and technologies to generate

RDF• Ideally, produce a set of explicit mappings from the data

sources to RDF

Page 47: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

RDF Generation: Tasks

1. Selection, extension or development technologies for RDF generation

2. Mapping of data sources to RDF

3. Transformation of data sources to RDF

Page 48: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

RDF Generation

Technologies for RDF Generation

INPUT:• Documentation of data sources (formats)

OUTPUT • Configuration of technologies (framework)

Open RefineJena Toolkit

any23

MorphR2rml engine

Page 49: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

3. RDF Generation

1. Transform the data source to RDF• For that we need to know the model and format of the data

source.• Increasing amount of RDFizers:

• RDB: New W3C Recommendation R2RML and Direct Mapping.

• CSV, TSV, Tabular data: Google Refine, NOR2O• XML: GRRDL, XSLT, etc.• …

• Not always easy to map source data to RDF: Low usability

2. Data cleansing:• New tools for data quality, for example LDIF (Linked Data

Integration Framework)

49

Page 50: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

RDF Generation

Technologies for RDF Generation

Page 51: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

RDF Generation

Mapping of data sources to RDF

INPUT:• Vocabulary• Data sources • Configuration of technologies

OUTPUT • Ideally, machine-processable mappings• RDF dataset

Page 52: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Data curation: Intro

• Cross-cutting activity• The goal is to:

• Ensure the quality of the LD dataset• Enable data curation at the data sources level

Page 53: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Data curation: Tasks

1. Data sources curation

2. RDF data curation

Page 54: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Data curation

Data sources curation

INPUT:• Data sources• Reports from issues in previous activities

OUTPUT • Documentation of issues in data sources• Fixes to data sources

Page 55: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Data curation

RDF data curation

INPUT:• RDF dataset• Vocabulary

OUTPUT • Documentation of issues• Fixes to RDF dataset

RDFUnita Test-Driven Data Debugging framework

http://validator.linkeddata.org/vapour

Page 56: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

RDF Data Curation

• Publication of RDF should at least:• Have a de-referenceable URI that responds RDF (Turtle

preferably)• Have a de-referenceable URI that responds a human-

readable representation (HTML)

• Two URI strategies:• #{id}:

• A request to http://example.org/vocabulary#Person

Returns the complete vocab to the client and the client is responsible to find #Person

• HTTP 303 or / Slash strategie:• A request to http://example.org/vocabulary/Person• Returns HTTP 303 with the location of the document with

content-negotiation

56

Page 57: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Exercise:

• Validate 5 diferent URIs from existing Linked Data projects using http://validator.linkeddata.org/vapour

• Search in the datahub for URIs to check if you don't know any dataset

• Analyse and explain the results

57

Page 58: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Linking: Intro

• The goal is to:• Maximize the connectivity to external datasets• Facilitate data discovery and enrichment

Page 59: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Linking: Tasks

1. Target datasets discovery and selection

2. Link discovery

3. Links validation

Page 60: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

4. Linking

1. Identify datasets that may be suitable as linking targets:• Still difficult, most people think on DBPedia• Use thedatahub.org to find similar datasets• Use Sindice to search for entities in your dataset

2. Discover relationships between data items of your dataset and the items of the identified datasets in the previous step• Most well-known semi-automatic tools: SILK, LIMES• Still require high level of configuration

3. Validate the relationships that have been discovered in the previous step• Still difficult, maybe it should be included in

editorial/cataloguing workflows?

60

Page 61: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Linking

Target datasets discovery and selection

INPUT:• RDF dataset• Vocabulary

OUTPUT • Selection of target datasets• Commonalities with target datasets (data, vocabulary level)

Page 62: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Linking

Link discovery

INPUT:• RDF dataset• Selection of target datasets

OUTPUT • Linking tasks specification• Tentative Linksets

Open Refine Silk Limes

Page 63: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Linking

Link validation

INPUT:• Tentative Linksets

OUTPUT • Validated linksets (including provenance and licensing

information)

sameAsValidator

Page 64: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Publication: Intro

• The goal is to:• Make available the RDF dataset following LD best practices• Facilitate dataset discovery and consumption

Page 65: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Publication: Tasks

1. Dataset and vocabulary publication on the Web

2. Metadata definition and publication

Page 66: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Publication

Dataset and vocabulary publication on the Web

INPUT:• RDF dataset• Validated linksets• Vocabulary

OUTPUT • Linked Dataset available and accesible on the Web• Vocabulary available and accesible on the Web (as LD)

SPARQL Linked Data APIsPuelia, Elda, LDP implementations

Page 67: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Publication

Dataset and vocabulary publication on the Web

Example of LD vocabulary:

GND Ontology:

http://d-nb.info/standards/elementset/gnd

Page 68: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Publication

Metadata definition and publication

INPUT:• All previous assets and documentation

OUTPUT • Dataset registered in relevant catalogues• Metadata available, accesible and discoverable

DCATVoID

Page 69: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Publication

Metadata definition and publication

DCAT: http://www.w3.org/TR/vocab-dcat/

Example:

:catalog a dcat:Catalog ; dct:title "Imaginary Catalog" ; rdfs:label "Imaginary Catalog" ; foaf:homepage <http://example.org/catalog> ; dct:publisher :transparency-office ; dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ; dcat:dataset :dataset-001 , :dataset-002 , :dataset-003 ; .

Page 70: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Exercise:

• Create a DCAT description in Turtle syntax of the Book, Authors, Collections catalog:

• Book dataset• Authors dataset• Collections dataset• If possible put different licenses, languages, and

provenance.

70

Page 71: Methodology for Linked Data Generation and publication Daniel Vila-Suero dvila@fi.upm.es

Conclusions

• Data curation is key to success of LD• Documentation of data sources and issues• Language issues have to be taken into account

during the whole process• Metadata description is key for enabling reusing and

discovery• Vocabulary have to be documented and published

following LD BPs