Transcript
Page 1: Sieve - Data Quality and Fusion - LWDM2012

Sieve Linked Data

Quality Assessment

and Fusion

Pablo N. Mendes

Hannes Mühleisen

Christian Bizer

With contributions from:

Andreas Schultz, Andrea Matteini, Christian Becker, Robert Isele

Page 2: Sieve - Data Quality and Fusion - LWDM2012

“A sieve, or sifter, separates wanted elements

from unwanted material using a woven screen

such as a mesh or net.” Source: http://en.wikipedia.org/wiki/Sieve

“sieve”

Page 3: Sieve - Data Quality and Fusion - LWDM2012

• Raw data (RDF)

• Accessible on the Web

• Data can link to other data sources

• Benefits: Ease of access and re-use; enables discovery

What is Linked Data?

Thing

Thing

Thing

Thing

Thing

Thing

A B C

Thing

Thing

Thing

Thing

D E

data link data link data link data link

Page 4: Sieve - Data Quality and Fusion - LWDM2012

Linking Open Data Cloud

http://lod-cloud.net

Page 5: Sieve - Data Quality and Fusion - LWDM2012

Linked Data Challenges

• Data providers have different intentions, experience/knowledge

• data may be inaccurate, outdated, spam etc.

• Data sources that overlap in content may use…

• ... different RDF schemata

• ... different identifiers for the same real-world entity

• …conflicting values for properties

• Integrating public datasets with internal databases poses the

same problems

Page 6: Sieve - Data Quality and Fusion - LWDM2012

An Architecture for Linked Data Applications

Page 7: Sieve - Data Quality and Fusion - LWDM2012

LDIF – Linked Data Integration Framework

• Open source (Apache License, Version 2.0)

• Collaboration between Freie Universität Berlin and mes|semantics

Collect data: Managed download and update

Translate data into a single target vocabulary

Resolve identifier aliases into local target URIs

Output

1

2

3

5

Assess quality, filter bad results, resolve conflicts 4

Page 8: Sieve - Data Quality and Fusion - LWDM2012

Supported data sources:

• RDF dumps (various formats)

• SPARQL Endpoints

• Crawling Linked Data

LDIF Pipeline

Collect data

Translate data

Resolve identities

Filter and fuse

1

2

3

4

Output 5

Page 9: Sieve - Data Quality and Fusion - LWDM2012

dbpedia-owl: City

LDIF Pipeline

Collect data

Translate data

Resolve identities

1

2

3 R2R

• Mappings expressed in RDF (Turtle)

• Simple mappings using OWL / RDFs statements (x rdfs:subClassOf y)

• Complex mappings with SPARQL expressivity

• Transformation functions

Data sources use a wide range of different RDF

vocabularies

schema:Place

fb:location.citytown

local:City

Filter and fuse 4

Output 5

Page 10: Sieve - Data Quality and Fusion - LWDM2012

LDIF Pipeline

Collect data

Translate data

Resolve identities

1

2

3

Silk

Berlin, Germany

Berlin, CT

Berlin, MD

Berlin, NJ

Berlin, MA

Berlin

• Profiles expressed in XML

• Supports various comparators and transformations

Data sources use different identifiers for the same entity

Berlin

=

Berlin,

Germany

Filter and fuse 4

Output 5

Page 11: Sieve - Data Quality and Fusion - LWDM2012

LDIF Pipeline

Collect data

Translate data

Resolve identities

1

2

Sieve

891.85 km2

891.82 km2

891.82 km2

891.85 km2

Quality

• Profiles expressed in XML

• Supports various scoring and fusion functions

Sources provide different values for the same property

Filter and fuse

Output 5

4

3

Total Area

Total Area

891.85 km2

Page 12: Sieve - Data Quality and Fusion - LWDM2012

• Output options:N-Quads

• N-Triples

• SPARQL Update Stream

• Provenance tracking using Named

Graphs

LDIF Pipeline

Collect data

Translate data

Resolve identities

1

2

3

Filter and fuse 4

Output 5

Page 13: Sieve - Data Quality and Fusion - LWDM2012

An Architecture for Linked Data Applications

Data Quality and

Fusion Module

Page 14: Sieve - Data Quality and Fusion - LWDM2012

Data Fusion

“fusing multiple records representing the same

real-world object into a single, consistent, and

clean representation”

(Bleiholder & Naumann, 2008)

Page 15: Sieve - Data Quality and Fusion - LWDM2012

Conflict resolution strategies

• Independent of quality assessment metrics

• Pick most frequent (democratic voting)

• Average, max, min, concatenation

• Within interval

• Based on task-specific quality assessment

• Keep highest scored

• Keep all that pass a threshold

• Trust some sources over others

• Weighted voting

Page 16: Sieve - Data Quality and Fusion - LWDM2012

Data Fusion

• Input:

• (Potentially) conflicting data

• Quality metadata describing input

• Execution:

• Use existing or custom FusionFunctions

• Output:

• Clean data, according to user’s definition of clean

Page 17: Sieve - Data Quality and Fusion - LWDM2012

Configuration: Data Fusion

Page 18: Sieve - Data Quality and Fusion - LWDM2012

Sieve: Quality Assessment

• Quality as “fitness for use”:

• Subjective:

• good for me might not be enough for you

• Task dependent:

• temperature: planning a weekend vs biology experiment

• Multidimensional:

• even correct data may be outdated or not available

• Requires task-specific quality assessment.

Page 19: Sieve - Data Quality and Fusion - LWDM2012

Data Quality - Conceptual Framework Dimension

Accuracy

Consistency

Objectivity

Timeliness

Validity

Believability

Completeness

Understandability

Relevancy

Reputation

Verifiability

Amount of Data

Interpretability

Rep. Conciseness

Rep. Consistency

Availability

Response Time

Security

Page 20: Sieve - Data Quality and Fusion - LWDM2012

Configuration: Quality Assessment

• Quality Assessment Metrics composed by:

• ScoringFunction (generically applicable to given data types)

• Quality Indicator as input (adaptable to use case)

• Output: [0;1]

Describes input within a quality dimension,

according to a user’s definition of quality

Page 21: Sieve - Data Quality and Fusion - LWDM2012

Configuration: Quality Assessment

Page 22: Sieve - Data Quality and Fusion - LWDM2012

More about Sieve

• Software: Open Source, Apache V2

• Scoring Functions and Fusion Functions can be extended

• Scala/Java interface, methods score/fuse and fromXML

• Quality scores can be stored and shared with other

applications

• Website: http://sieve.wbsg.de

• Documentation, examples, downloads, support

Page 23: Sieve - Data Quality and Fusion - LWDM2012

Use Case

Conflicting values

Quality indicators

User config Voilá!

(Multidimensional)

(Task-dependent)

Multiple data sources

(Complementary)

(Conflict

Resolution

Strategies)

(Heterogeneous)

Page 24: Sieve - Data Quality and Fusion - LWDM2012

Evaluating Quality of Data Integration

• Completeness

• How many cities did we find?

• How many of the properties did we fill with values?

• Conciseness

• How much redundancy is there in the object identifiers?

• How much redundancy is there in the property values?

• Consistency

• How many conflicting values are there?

Page 25: Sieve - Data Quality and Fusion - LWDM2012

Results

Generated data that is more complete, concise

and consistent than in the original sources

Page 26: Sieve - Data Quality and Fusion - LWDM2012

Linked Data application Architecture

My view on this data space can also be

shared, and reused.

We can “pay as we go”

Page 27: Sieve - Data Quality and Fusion - LWDM2012

• Twitter: @pablomendes

• E-mail: [email protected]

• Website: http://sieve.wbsg.de

• Google Group: http://bit.ly/ldifgroup

THANK YOU!

Supported in part by: Vulcan Inc. as part of its Project Halo

EU FP7 projects:

-LOD2 - Creating Knowledge out of Interlinked Data

-PlanetData - A European Network of Excellence on Large-Scale Data Management


Top Related