quality and repair pablo n. mendes (freie universität berlin) giorgos flouris (forth) 1st year...

37
Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Post on 22-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Quality and Repair

Pablo N. Mendes (Freie Universität Berlin)Giorgos Flouris (FORTH)

1st year reviewLuxembourg, December 2011

11/02/11

Page 2: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

18 24 30 366 120

Task 2.1Data quality assessment and repair

Task 2.3Recommendations for enhancing best practices for data publishing

D2.4 Update of D2.1

D2.3 Modelling and processing contextual aspects of data

D2.5 Proof-of-concept evaluation for modelling space and time

FUBFUB

42 48D2.1 Conceptual model and best practices for high-quality data publishing

D2.1 Conceptual model and best practices for high-quality data publishing

D2.2 Methods for quality repairD2.2 Methods for quality repair

KITKIT

KITKIT

Work Plan View WP2

D2.6 Methods for assessing the quality of sensor data

D2.7 Recommendations for contextual data publishing

Task 2.2Temporal, spatial and social aspects of data

Page 3: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Upcoming deliverables

Quality AssessmentD2.1 - Conceptual model and best practices for high-quality metadata publishing

Quality EnhancementD2.2 - Methods for quality repair

Page 4: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Outline

Overview of Quality

Data Quality Framework

Quality Assessment

Quality Enhancement (Repair)

Page 5: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

“Fitness for use.”

Joseph Juran. The Quality Control Handbook. McGraw-Hill,New York, 3rd edition, 1974.

Quality

Page 6: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Data Quality

Multifaceted

accurate = high quality?

availability?

timeliness?

Subjective

weekly updates are ok.

Task-dependent

task: weather forecast

data is not good if it is not available for online query

vacation planning or aviation?

for me, for vacation planning

Page 7: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Category Dimension

Intrinsic Dimensions

AccuracyConsistencyObjectivityTimeliness

Contextual Dimensions

ValidityBelievabilityCompletenessUnderstandabilityRelevancyReputationVerifiabilityAmount of Data

Representational Dimensions

InterpretabilityRep. ConcisenessRep. Consistency

Accessibility Dimensions

AvailabilityResponse TimeSecurity

Data Quality Dimensions

Presentation order

Page 8: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Quality Enhancement Quality Assessment

Data Quality Framework

Page 9: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

ACCESSIBILITY

Dereferenceability

• Indicator: Dereferenceable URIs• “Resources identified by URIs that respond

with RDF to HTTP requests?”• Metrics:

• for datasets (d) and for resources (r)• deref(d) = count(r | deref(r))• ratioderef(d) = deref(d) / no-deref(r)

• Recommendation:• Your URIs should be dereferenceable.• Prefer reusing URIs that are dereferenceable.

Page 10: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Access methods

• Indicator: Access methods• “Data is accessible in varied and recommended ways.”

• Metrics:• sample(d): {0,1} “example resource available for d”• endpoint(d): {0,1} “SPARQL endpoint available for d”• dump(d): {0,1} “RDF dumps available for d”

• Recommendation:• Provide as many access methods as possible• A sample resource provides a quick view into the type

of data you serve.• SPARQL endpoints for clients to obtain part of the data• Dumps are cheaper than alternatives when bulk access

is needed

ACCESSIBILITY

Page 11: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Availability

• Indicator: Availability• “Average availability in time interval”

• Metrics: • avail(d,hour) = ∑{1..24} deref(sample(d)) / 24• Alternatively, httphead() instead of deref()

• Recommendation: • the higher the better

ACCESSIBILITY

Page 12: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Accessiblity Dimensions

DereferenceabilityAvailabilityAccess methodsResponse timeRobustnessReachability

http GET / HEADhourly derefsURI, Bulk, SPARQLtimed derefrequests per minuteLOD cloud inlinks

ACCESSIBILITY

Examples:

Page 13: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Representational: Interpretability

• Indicator: Human/Machine interpretability• “URI is dereferenceable to human and machine

readable formats”

• Metrics:• format(deref(r,f)) in {Fh U Fm} : {0,1}

• Fh = HTML, XHTML+RDFa, ...: {0,1}

• Fm = NT, RDF/XML, ...: {0,1}

• Recommendation:• Resources should dereference at least to human-

readable HTML and one widely adopted RDF serialization.

REPRESENTATIONAL

Page 14: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

REPRESENTATIONAL

Vocabulary understandability

• Schema understandability• “Schema terms are familiar to existing

agents.”

• Metrics:• vocab-underst(d) = triples(v,d) * triples(v,D) / triples(D)• Alt: Page Rank (prob. that random surfer has found v)

• Recommendation:• Reuse widely deployed vocabularies.

Page 15: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Representational Dimensions

Human/Machine Interpretability

Vocabulary Understandability

Representational Conciseness

HTML, RDF

Vocabulary usage stats

Triples / Byte

REPRESENTATIONAL

Page 16: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Contextual Dimensions

CompletenessFull set of objects and attributes wrt to a task

ConcisenessAmount of duplicate entries, redundant attributes

CoherenceHow well instance data conforms to schema

CONTEXTUAL DIMENSIONS

Page 17: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Contextual Dimensions

VerifiabilityHow easy it is to check the data? Can use provenance information.

ValidityEncodes context- or application-specific requirements

CONTEXTUAL DIMENSIONS

Page 18: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

INTRINSIC DIMENSIONS

Intrinsic Dimensions

Accuracy

usually estimated; may be available for sensors

Timeliness

can use last update

Consistency

two or more values do not conflict with each other

Objectivity

Can be traced via provenance

Page 19: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Example: AEMET

Metadata entry: http://thedatahub.org/dataset/aemet

Example item: http://aemet.linkeddata.es/page/resource/WeatherStation/id08001?output=ttl

Access methods: Example URI, SPARQL, BulkAvailability:

Example URI: availableSPARQL Endpoint: 100%

Format Interpretability: TTL=OKRDF/XML=OK

Verifiability: Published by third party

http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=aemet

Page 20: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Quality Enhancement Quality Assessment

Data Quality Framework

Page 21: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Validity as a Quality Indicator

Validity is an important quality indicatorEncodes context- or application-specific requirementsApplications may be useless over invalid dataBinary concept (valid/invalid)

Two steps to guarantee validity (repair process):1. Identifying invalid ontologies (diagnosis)

Detecting invalidities in an automated mannerSubtask of Quality Assessment

2. Remove invalidities (repair)Repairing invalidities in an automated mannerSubtask of Quality Enhancement

Page 22: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Diagnosis

Expressing validity using validity rules over an adequate relational schema

Examples:Properties must have a unique domain

p Prop(p) a Dom(p,a)p,a,b Dom(p,a) Dom(p,b) (a=b)

Correct classification in property instancesx,y,p,a P_Inst(x,y,p) Dom(p,a)

C_Inst(x,a)x,y,p,a P_Inst(x,y,p) Rng(p,a)

C_Inst(y,a)

Diagnosis reduced to relational queries

Page 23: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Ontology O0

Class(Sensor), Class(SpatialThing), Class(Observation)Prop(geo:location)Dom(geo:location,Sensor)Rng(geo:location,SpatialThing)Inst(Item1), Inst(ST1)P_Inst(Item1,ST1,geo:location)C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing)

Example

Correct classification in property instances x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)

Sensor SpatialThing

Observation

Item1 ST1

geo:location

Schema

Data

Item1 geo:location ST1 Sensor is the domain of geo:locationItem1 is not a Sensor

P_Inst(Item1,ST1,geo:location)O0

Remove P_Inst(Item1,ST1,geo:location)

Add C_Inst(Item1,Sensor)Remove Dom(geo:location,Sensor)

C_Inst(Item1,Sensor)O0

Dom(geo:location,Sensor)O0

Page 24: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Preferences for Repair

Which repairing option is best?Ontology engineer determines that via

preferences

Specified by ontology engineer beforehandHigh-level “specifications” for the ideal

repairServe as “instructions” to determine the

preferred solution

Page 25: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Preferences (On Ontologies)

O0

O2

O3

Score: 3

Score: 4

Score: 6

O1

Page 26: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Preferences (On Deltas)

O0

O1

O2

O3Score: 2

Score: 4

Score: 5

-P_Inst (Item1,ST1, geo:location)

+C_Inst (Item1,Sensor)

-Dom (geo:location,

Sensor)

Page 27: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Preferences

Preferences on ontologies are result-orientedConsider the quality of the repair resultIgnore the impact of repairPopular options: prefer newest information, prefer

trustable informationPreferences on deltas are more impact-oriented

Consider the impact of repairIgnore the quality of the repair resultPopular options: minimize schema changes, minimize

addition/deletion of information, minimize delta sizeTwo sides of the same coin (equivalent options)

Quality metrics can be used for stating preferencesMetadata on the data may be neededCan be qualitative or quantitative

Page 28: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Generalizing the Approach

For one violated constraint1. Diagnose invalidity2. Determine minimal ways to resolve it3. Determine and return preferred resolution

For many violated constraintsProblem becomes more complicatedMore than one resolution steps are required

Issues:1. Resolution order2. When and how to filter non-preferred solutions?3. Constraint (and resolution) interdependencies

Page 29: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Constraint Interdependencies

A given resolution may:Cause other violations (bad)Resolve other violations (good)

Cannot pre-determine the best resolutionDifficult to predict the ramifications of each oneExhaustive search requiredRecursive, tree-based search (resolution tree)

Two ways to create the resolution tree Globally-preferred (GP), locally-preferred (LP)When and how to filter non-preferred solutions?

Page 30: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Resolution Tree Creation (GP)

Find all minimal resolutions for all the violated constraints, then find the preferred ones

Globally-preferred (GP)Find all minimal resolutions for

one violationExplore them allRepeat recursively until

consistentReturn the preferred leaves

Preferred repairs (returned)

Page 31: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Resolution Tree Creation (LP)

Find the minimal and preferred resolutions for one violated constraint, then repeat for the next

Locally-preferred (LP)Find all minimal resolutions for

one violationExplore the preferred one(s)Repeat recursively until

consistentReturn all remaining leaves

Preferred repair (returned)

Page 32: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Comparison (GP versus LP)

Characteristics of GP ExhaustiveLess efficient: large resolution treesAlways returns most preferred repairsInsensitive to constraint syntaxDoes not depend on resolution order

Characteristics of LPGreedyMore efficient: small resolution treesDoes not always return most preferred repairsSensitive to constraint syntaxDepends on resolution order

Page 33: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Algorithm and Complexity

Detailed complexity analysis for GP/LP and various different types of constraints and preferences

Inherently difficult problemExponential complexity (in general)Main exception: LP is polynomial (in special

cases)

Theoretical complexity is misleading as to the actual performance of the algorithms

Page 34: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Performance in Practice

Performance in practiceLinear with respect to ontology sizeLinear with respect to tree size

Types of violated constraints (tree width)Number of violations (tree height) – causes

the exponential blowupConstraint interdependencies (tree height)Preference (for LP): affects pruning (tree

width)

Further performance improvementUse optimizationsUse LP with restrictive preference

Page 35: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Evaluation Parameters

Evaluation1. Effect of ontology size (for GP/LP)2. Effect of tree size (for GP/LP)3. Effect of violations (for GP/LP)4. Effect of preference (relevant for LP only)5. Quality of LP repairs

Preliminary results support our claims:Linear with respect to ontology sizeLinear with respect to tree size

Page 36: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Publications

Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011

Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF/S DBs. Tentative title, to be submitted to PVLDB, January 2012

Page 37: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

Outlook

• Continue refining model based on experience with data sets catalog

• Derive “best practices checks” from metrics

• Results of quality assessment to be added to next release of the catalog

• Collaboration with EU-funded LOD2 (FP7) towards Data Fusion based on the PlanetData Quality Framework

• Finalize experiments for Data Repair