a candidate dataset discovery and linkage recommendation system … · 2012-07-20 · 3...
TRANSCRIPT
1
A Candidate Dataset Discovery and
Linkage Recommendation System
for Linked Data
OC WG Meeting – 26.6.2012
Michael Luger
2
Outline
● Introduction
● Linked Data
● Candidate Dataset Selection
● Linkage Recommendation
● Conclusion
3
Introduction
● Maturing of Semantic Web technologies
● Growing amount of structured data becoming available as Linked Data
– Data publishers from a variety of domains should be encouraged to participate
● Establishment of links between Linked Data resources → Data Linking
– Draws influences from Record Linkage, Ontology Matching
– Need for solutions to facilitate this process
● Data Linking can benefit from the exploitation of various information sources
– Available Metadata
– Ontology Alignments
– Statistical Information
– User Input
4
Linked Data – Linking Open Data Initiative
● 300 datasets
● 30 billion triples
● 500 million RDF links
LOD Cloud – September 2011
5
Linked Data – Publishing Workflow
Interlinking
Publication
Conversion
Ontology Selection
Application
Dataset Selection
Matcher Configuration
Matching
6
Dataset Discovery – Overview
● Goal – Given a source dataset, discover candidate target datasets from the LOD cloud
Set of LOD Datasets
Source Dataset
InformationExtraction
InformationExtraction(pre-computed)
Representative Information Sources(Keywords, Schema Elements, Topics, ...)
Representative Information Sources(Keywords, Schema Elements, Topics, ...)
Comparison ofInformationSources
Results –Top ranked similar Datasets
from the set of LOD Datasets
User
User-suppliedInformationSources
7
Dataset Discovery – Metadata Initiatives (i)
● CKAN
– Data hub platform covering LOD datasets
– Allows organizations to publish metadata about their data in a structured format
Example – CKAN metadata published about the dataset bbcmusic.
8
Dataset Discovery – Metadata Initiatives (ii)
● Ontologies for Linked Data
– VoID – Vocabulary of Interlinked Datasets● RDFS vocabulary for describing linked datasets● Main concepts – void:Dataset, void:Linkset
– LOV – Linked Open Vocabularies● RDFS vocabulary for describing ontologies used by linked datasets
– VOAF – Vocabulary of a Friend● Extension of LOV● Allows to describe relationships between ontologies and topic classifications
9
Dataset Discovery – Metadata Initiatives (iii)
Example – LOV / VOAF metadata about the Music Ontology.
10
Dataset Discovery – Information Sources
● Schema Elements
– Classes and properties, extracted from datasets using SPARQL
– Ontology Alignments
● Keywords
– Class- and property-labels, extracted from schema information
– CKAN metadata (tags, groups, …)
– User-supplied keywords
● Vocabularies
– Derived from schema elements
– Retrieved from VoID metadata
● Topics
– Retrieved from LOV / VOAF metadata
– User-supplied topics
11
Dataset Discovery – Computation
● Computation of overlap score O between the source dataset DS and each potential target dataset DT
● DS … Source Dataset
● DT … Target Dataset
● KWDS, KWDT, SEDS, SEDT, TDS, TDT ... Sets of Keywords, Schema Elements, Topics
● kwDS, kwDT, seDS, seDT, tDS, tDT ... Individual Keywords, Schema Elements, Topics
● Mkw, Mse, Mt … Matching Predicates
● wkw, wse, wt ... Weights
12
Dataset Discovery – Results
● CKAN Metadata
– Complete but can skew results (non-representative keywords)
● VoID Descriptions
– Not available for all the datasets
● Schema Elements
– Problems with retrieval for some datasets
– Ontology alignments improve results significantly
– Potential through application of ontology matching
● Topics
– LOV provides a very general classification
– Potential through application of Upper Level Ontologies
● Computation works well in combination with user input (weights, custom input)
13
Linkage Recommendation – Overview
● Matching Linked Data resources → Data Linking
– Given two datasets, links between their resources are established by means of assessing similarity through matching
– Reliance on well established value-matching techniques
– Adapted to RDF-type data characteristics (graph-like structure, heterogeneous vocabularies, large collections)
● Existing data linking tools
– Fully automated approaches exhibit limited applicability
– Semi-automated approaches demand manual configuration → Linkage Specification
● Goal – User interactive environment for creating linkage specifications
– Recommendation of linkage specifications
– Exploitation of available information sources
– Ability to perform matching and evaluate results in an iterative way
14
Linkage Recommendation – Specification (Silk LSL) <Prefixes> <Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" /> <Prefix id="dbpedia" namespace="http://dbpedia.org/ontology/" /> <Prefix id="gn" namespace="http://www.geonames.org/ontology#" /> </Prefixes>
<DataSources> <DataSource id="dbpedia"> <Param name="endpointURI" value="http://example.org/sparql" /> <Param name="graph" value="http://dbpedia.org" /> </DataSource> <DataSource id="geonames"> <Param name="endpointURI" value="http://example2.org/sparql" /> <Param name="graph" value="http://sws.geonames.org/" /> </DataSource> </DataSources> <Interlinks> <Interlink id="cities"> <LinkType>owl:sameAs</LinkType> <SourceDataset dataSource="dbpedia" var="a"> <RestrictTo> ?a rdf:type dbpedia:City </RestrictTo> </SourceDataset> <TargetDataset dataSource="geonames" var="b"> <RestrictTo> ?b rdf:type gn:P </RestrictTo> </TargetDataset> <LinkageRule> <Aggregate type="average"> <Compare metric="jaro"> <Input path="?a/rdfs:label" /> <Input path="?b/gn:name" /> </Compare> <Compare metric="num"> <Input path="?a/dbpedia:populationTotal" /> <Input path="?b/gn:population" /> </Compare> </Aggregate> </LinkageRule> <...>
15
Linkage Recommendation – Information Sources (i)
● User input
– Tabular view for browsing the contents of two chosen datasets
– Filtering of contents by specifying constraints
– Ability to specify and execute linkage specification
– Display of additional information sources inline
User
Matcher &AnalysisService
Storage / Retrieval
Classes Properties Classes Properties
AlignmentServer
Source DatasetEndpoint
Target DatasetEndpoint
SPARQL SPARQL
16
Linkage Recommendation – Information Sources (ii)
● Ontology Alignments
– Management of ontology correspondences as part of the user interface
● Statistics
– Property Discriminability & Coverage
dis p=∣ o | t={< i , p , o >} ∣∣ t | t={< i , p , o > } ∣
cov p =∣ i |t={< i , p , o >} ∣∣ i | t={< i ,∗, o >} ∣
– Sample Property Pair Analysis● Instance-based ontology matching based on instance samples
17
Linkage Recommendation – User Interface (i)
18
Linkage Recommendation – User Interface (ii)
19
Linkage Recommendation – User Interface (iii)
20
Conclusion
● Linked Data is exhibiting growing adaptation
● Development of a Data Linking system
– Focus on dataset selection & matcher configuration
– Combination of available information sources and user feedback
● Outlook
– Data publishers should be encouraged to participate in Linked Data
– High quality data is key
– Important aspects also include up-to-date views on data, trust, provenance information, ...
– Ongoing research and growing tool support