a candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data

1

A Candidate Dataset Discovery and

Linkage Recommendation System

for Linked Data

OC WG Meeting – 26.6.2012

Michael Luger

2

Outline

● Introduction

● Linked Data

● Candidate Dataset Selection

● Linkage Recommendation

● Conclusion

3

Introduction

● Maturing of Semantic Web technologies

● Growing amount of structured data becoming available as Linked Data

– Data publishers from a variety of domains should be encouraged to participate

● Establishment of links between Linked Data resources → Data Linking

– Draws influences from Record Linkage, Ontology Matching

– Need for solutions to facilitate this process

● Data Linking can benefit from the exploitation of various information sources

– Available Metadata

– Ontology Alignments

– Statistical Information

– User Input

4

Linked Data – Linking Open Data Initiative

● 300 datasets

● 30 billion triples

● 500 million RDF links

LOD Cloud – September 2011

5

Linked Data – Publishing Workflow

Interlinking

Publication

Conversion

Ontology Selection

Application

Dataset Selection

Matcher Configuration

Matching

6

Dataset Discovery – Overview

● Goal – Given a source dataset, discover candidate target datasets from the LOD cloud

Set of LOD Datasets

Source Dataset

InformationExtraction

InformationExtraction(pre-computed)

Representative Information Sources(Keywords, Schema Elements, Topics, ...)

Representative Information Sources(Keywords, Schema Elements, Topics, ...)

Comparison ofInformationSources

Results –Top ranked similar Datasets

from the set of LOD Datasets

User

User-suppliedInformationSources

7

Dataset Discovery – Metadata Initiatives (i)

● CKAN

– Data hub platform covering LOD datasets

– Allows organizations to publish metadata about their data in a structured format

Example – CKAN metadata published about the dataset bbcmusic.

8

Dataset Discovery – Metadata Initiatives (ii)

● Ontologies for Linked Data

– VoID – Vocabulary of Interlinked Datasets● RDFS vocabulary for describing linked datasets● Main concepts – void:Dataset, void:Linkset

– LOV – Linked Open Vocabularies● RDFS vocabulary for describing ontologies used by linked datasets

– VOAF – Vocabulary of a Friend● Extension of LOV● Allows to describe relationships between ontologies and topic classifications

9

Dataset Discovery – Metadata Initiatives (iii)

Example – LOV / VOAF metadata about the Music Ontology.

10

Dataset Discovery – Information Sources

● Schema Elements

– Classes and properties, extracted from datasets using SPARQL

– Ontology Alignments

● Keywords

– Class- and property-labels, extracted from schema information

– CKAN metadata (tags, groups, …)

– User-supplied keywords

● Vocabularies

– Derived from schema elements

– Retrieved from VoID metadata

● Topics

– Retrieved from LOV / VOAF metadata

– User-supplied topics

11

Dataset Discovery – Computation

● Computation of overlap score O between the source dataset DS and each potential target dataset DT

● DS … Source Dataset

● DT … Target Dataset

● KWDS, KWDT, SEDS, SEDT, TDS, TDT ... Sets of Keywords, Schema Elements, Topics

● kwDS, kwDT, seDS, seDT, tDS, tDT ... Individual Keywords, Schema Elements, Topics

● Mkw, Mse, Mt … Matching Predicates

● wkw, wse, wt ... Weights

12

Dataset Discovery – Results

● CKAN Metadata

– Complete but can skew results (non-representative keywords)

● VoID Descriptions

– Not available for all the datasets

● Schema Elements

– Problems with retrieval for some datasets

– Ontology alignments improve results significantly

– Potential through application of ontology matching

● Topics

– LOV provides a very general classification

– Potential through application of Upper Level Ontologies

● Computation works well in combination with user input (weights, custom input)

13

Linkage Recommendation – Overview

● Matching Linked Data resources → Data Linking

– Given two datasets, links between their resources are established by means of assessing similarity through matching

– Reliance on well established value-matching techniques

– Adapted to RDF-type data characteristics (graph-like structure, heterogeneous vocabularies, large collections)

● Existing data linking tools

– Fully automated approaches exhibit limited applicability

– Semi-automated approaches demand manual configuration → Linkage Specification

● Goal – User interactive environment for creating linkage specifications

– Recommendation of linkage specifications

– Exploitation of available information sources

– Ability to perform matching and evaluate results in an iterative way

14

Linkage Recommendation – Specification (Silk LSL) <Prefixes> <Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" /> <Prefix id="dbpedia" namespace="http://dbpedia.org/ontology/" /> <Prefix id="gn" namespace="http://www.geonames.org/ontology#" /> </Prefixes>

<DataSources> <DataSource id="dbpedia"> <Param name="endpointURI" value="http://example.org/sparql" /> <Param name="graph" value="http://dbpedia.org" /> </DataSource> <DataSource id="geonames"> <Param name="endpointURI" value="http://example2.org/sparql" /> <Param name="graph" value="http://sws.geonames.org/" /> </DataSource> </DataSources> <Interlinks> <Interlink id="cities"> <LinkType>owl:sameAs</LinkType> <SourceDataset dataSource="dbpedia" var="a"> <RestrictTo> ?a rdf:type dbpedia:City </RestrictTo> </SourceDataset> <TargetDataset dataSource="geonames" var="b"> <RestrictTo> ?b rdf:type gn:P </RestrictTo> </TargetDataset> <LinkageRule> <Aggregate type="average"> <Compare metric="jaro"> <Input path="?a/rdfs:label" /> <Input path="?b/gn:name" /> </Compare> <Compare metric="num"> <Input path="?a/dbpedia:populationTotal" /> <Input path="?b/gn:population" /> </Compare> </Aggregate> </LinkageRule> <...>

15

Linkage Recommendation – Information Sources (i)

● User input

– Tabular view for browsing the contents of two chosen datasets

– Filtering of contents by specifying constraints

– Ability to specify and execute linkage specification

– Display of additional information sources inline

User

Matcher &AnalysisService

Storage / Retrieval

Classes Properties Classes Properties

AlignmentServer

Source DatasetEndpoint

Target DatasetEndpoint

SPARQL SPARQL

16

Linkage Recommendation – Information Sources (ii)

● Ontology Alignments

– Management of ontology correspondences as part of the user interface

● Statistics

– Property Discriminability & Coverage

dis p=∣ o | t={< i , p , o >} ∣∣ t | t={< i , p , o > } ∣

cov p =∣ i |t={< i , p , o >} ∣∣ i | t={< i ,∗, o >} ∣

– Sample Property Pair Analysis● Instance-based ontology matching based on instance samples

17

Linkage Recommendation – User Interface (i)

18

Linkage Recommendation – User Interface (ii)

19

Linkage Recommendation – User Interface (iii)

20

Conclusion

● Linked Data is exhibiting growing adaptation

● Development of a Data Linking system

– Focus on dataset selection & matcher configuration

– Combination of available information sources and user feedback

● Outlook

– Data publishers should be encouraged to participate in Linked Data

– High quality data is key

– Important aspects also include up-to-date views on data, trust, provenance information, ...

– Ongoing research and growing tool support

a candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data

Education

linked data resources

linked data data publishers

structured data

dataset bbcmusic

source dataset ds

datasets ontology alignments

process data linking

void metadata topics