quality and repair pablo n. mendes (freie universität berlin) giorgos flouris (forth) 1st year...

Quality and Repair

Pablo N. Mendes (Freie Universität Berlin)Giorgos Flouris (FORTH)

1st year reviewLuxembourg, December 2011

11/02/11

18 24 30 366 120

Task 2.1Data quality assessment and repair

Task 2.3Recommendations for enhancing best practices for data publishing

D2.4 Update of D2.1

D2.3 Modelling and processing contextual aspects of data

D2.5 Proof-of-concept evaluation for modelling space and time

FUBFUB

42 48D2.1 Conceptual model and best practices for high-quality data publishing

D2.1 Conceptual model and best practices for high-quality data publishing

D2.2 Methods for quality repairD2.2 Methods for quality repair

KITKIT

KITKIT

Work Plan View WP2

D2.6 Methods for assessing the quality of sensor data

D2.7 Recommendations for contextual data publishing

Task 2.2Temporal, spatial and social aspects of data

Upcoming deliverables

Quality AssessmentD2.1 - Conceptual model and best practices for high-quality metadata publishing

Quality EnhancementD2.2 - Methods for quality repair

Outline

Overview of Quality

Data Quality Framework

Quality Assessment

Quality Enhancement (Repair)

“Fitness for use.”

Joseph Juran. The Quality Control Handbook. McGraw-Hill,New York, 3rd edition, 1974.

Quality

Data Quality

Multifaceted

accurate = high quality?

availability?

timeliness?

Subjective

weekly updates are ok.

Task-dependent

task: weather forecast

data is not good if it is not available for online query

vacation planning or aviation?

for me, for vacation planning

Category Dimension

Intrinsic Dimensions

AccuracyConsistencyObjectivityTimeliness

Contextual Dimensions

ValidityBelievabilityCompletenessUnderstandabilityRelevancyReputationVerifiabilityAmount of Data

Representational Dimensions

InterpretabilityRep. ConcisenessRep. Consistency

Accessibility Dimensions

AvailabilityResponse TimeSecurity

Data Quality Dimensions

Presentation order

Quality Enhancement Quality Assessment


ACCESSIBILITY

Dereferenceability

• Indicator: Dereferenceable URIs• “Resources identified by URIs that respond

with RDF to HTTP requests?”• Metrics:

• for datasets (d) and for resources (r)• deref(d) = count(r | deref(r))• ratioderef(d) = deref(d) / no-deref(r)

• Recommendation:• Your URIs should be dereferenceable.• Prefer reusing URIs that are dereferenceable.

Access methods

• Indicator: Access methods• “Data is accessible in varied and recommended ways.”

• Metrics:• sample(d): {0,1} “example resource available for d”• endpoint(d): {0,1} “SPARQL endpoint available for d”• dump(d): {0,1} “RDF dumps available for d”

• Recommendation:• Provide as many access methods as possible• A sample resource provides a quick view into the type

of data you serve.• SPARQL endpoints for clients to obtain part of the data• Dumps are cheaper than alternatives when bulk access

is needed

ACCESSIBILITY

Availability

• Indicator: Availability• “Average availability in time interval”

• Metrics: • avail(d,hour) = ∑{1..24} deref(sample(d)) / 24• Alternatively, httphead() instead of deref()

• Recommendation: • the higher the better

ACCESSIBILITY

Accessiblity Dimensions

DereferenceabilityAvailabilityAccess methodsResponse timeRobustnessReachability

http GET / HEADhourly derefsURI, Bulk, SPARQLtimed derefrequests per minuteLOD cloud inlinks

ACCESSIBILITY

Examples:

Representational: Interpretability

• Indicator: Human/Machine interpretability• “URI is dereferenceable to human and machine

readable formats”

• Metrics:• format(deref(r,f)) in {Fh U Fm} : {0,1}

• Fh = HTML, XHTML+RDFa, ...: {0,1}

• Fm = NT, RDF/XML, ...: {0,1}

• Recommendation:• Resources should dereference at least to human-

readable HTML and one widely adopted RDF serialization.

REPRESENTATIONAL

REPRESENTATIONAL

Vocabulary understandability

• Schema understandability• “Schema terms are familiar to existing

agents.”

• Metrics:• vocab-underst(d) = triples(v,d) * triples(v,D) / triples(D)• Alt: Page Rank (prob. that random surfer has found v)

• Recommendation:• Reuse widely deployed vocabularies.

Representational Dimensions

Human/Machine Interpretability

Vocabulary Understandability

Representational Conciseness

HTML, RDF

Vocabulary usage stats

Triples / Byte

REPRESENTATIONAL


CompletenessFull set of objects and attributes wrt to a task

ConcisenessAmount of duplicate entries, redundant attributes

CoherenceHow well instance data conforms to schema

CONTEXTUAL DIMENSIONS


VerifiabilityHow easy it is to check the data? Can use provenance information.

ValidityEncodes context- or application-specific requirements

CONTEXTUAL DIMENSIONS

INTRINSIC DIMENSIONS

Intrinsic Dimensions

Accuracy

usually estimated; may be available for sensors

Timeliness

can use last update

Consistency

two or more values do not conflict with each other

Objectivity

Can be traced via provenance

Example: AEMET

Metadata entry: http://thedatahub.org/dataset/aemet

Example item: http://aemet.linkeddata.es/page/resource/WeatherStation/id08001?output=ttl

Access methods: Example URI, SPARQL, BulkAvailability:

Example URI: availableSPARQL Endpoint: 100%

Format Interpretability: TTL=OKRDF/XML=OK

Verifiability: Published by third party

http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=aemet

Quality Enhancement Quality Assessment


Validity as a Quality Indicator

Validity is an important quality indicatorEncodes context- or application-specific requirementsApplications may be useless over invalid dataBinary concept (valid/invalid)

Two steps to guarantee validity (repair process):1. Identifying invalid ontologies (diagnosis)

Detecting invalidities in an automated mannerSubtask of Quality Assessment

2. Remove invalidities (repair)Repairing invalidities in an automated mannerSubtask of Quality Enhancement

Diagnosis

Expressing validity using validity rules over an adequate relational schema

Examples:Properties must have a unique domain

p Prop(p) a Dom(p,a)p,a,b Dom(p,a) Dom(p,b) (a=b)

Correct classification in property instancesx,y,p,a P_Inst(x,y,p) Dom(p,a)

C_Inst(x,a)x,y,p,a P_Inst(x,y,p) Rng(p,a)

C_Inst(y,a)

Diagnosis reduced to relational queries

Ontology O0

Class(Sensor), Class(SpatialThing), Class(Observation)Prop(geo:location)Dom(geo:location,Sensor)Rng(geo:location,SpatialThing)Inst(Item1), Inst(ST1)P_Inst(Item1,ST1,geo:location)C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing)

Example

Correct classification in property instances x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)

Sensor SpatialThing

Observation

Item1 ST1

geo:location

Schema

Data

Item1 geo:location ST1 Sensor is the domain of geo:locationItem1 is not a Sensor

P_Inst(Item1,ST1,geo:location)O0

Remove P_Inst(Item1,ST1,geo:location)

Add C_Inst(Item1,Sensor)Remove Dom(geo:location,Sensor)

C_Inst(Item1,Sensor)O0

Dom(geo:location,Sensor)O0

Preferences for Repair

Which repairing option is best?Ontology engineer determines that via

preferences

Specified by ontology engineer beforehandHigh-level “specifications” for the ideal

repairServe as “instructions” to determine the

preferred solution

Preferences (On Ontologies)

O0

O2

O3

Score: 3

Score: 4

Score: 6

O1

Preferences (On Deltas)

O0

O1

O2

O3Score: 2

Score: 4

Score: 5

-P_Inst (Item1,ST1, geo:location)

+C_Inst (Item1,Sensor)

-Dom (geo:location,

Sensor)

Preferences

Preferences on ontologies are result-orientedConsider the quality of the repair resultIgnore the impact of repairPopular options: prefer newest information, prefer

trustable informationPreferences on deltas are more impact-oriented

Consider the impact of repairIgnore the quality of the repair resultPopular options: minimize schema changes, minimize

addition/deletion of information, minimize delta sizeTwo sides of the same coin (equivalent options)

Quality metrics can be used for stating preferencesMetadata on the data may be neededCan be qualitative or quantitative

Generalizing the Approach

For one violated constraint1. Diagnose invalidity2. Determine minimal ways to resolve it3. Determine and return preferred resolution

For many violated constraintsProblem becomes more complicatedMore than one resolution steps are required

Issues:1. Resolution order2. When and how to filter non-preferred solutions?3. Constraint (and resolution) interdependencies

Constraint Interdependencies

A given resolution may:Cause other violations (bad)Resolve other violations (good)

Cannot pre-determine the best resolutionDifficult to predict the ramifications of each oneExhaustive search requiredRecursive, tree-based search (resolution tree)

Two ways to create the resolution tree Globally-preferred (GP), locally-preferred (LP)When and how to filter non-preferred solutions?

Resolution Tree Creation (GP)

Find all minimal resolutions for all the violated constraints, then find the preferred ones

Globally-preferred (GP)Find all minimal resolutions for

one violationExplore them allRepeat recursively until

consistentReturn the preferred leaves

Preferred repairs (returned)

Resolution Tree Creation (LP)

Find the minimal and preferred resolutions for one violated constraint, then repeat for the next

Locally-preferred (LP)Find all minimal resolutions for

one violationExplore the preferred one(s)Repeat recursively until

consistentReturn all remaining leaves

Preferred repair (returned)

Comparison (GP versus LP)

Characteristics of GP ExhaustiveLess efficient: large resolution treesAlways returns most preferred repairsInsensitive to constraint syntaxDoes not depend on resolution order

Characteristics of LPGreedyMore efficient: small resolution treesDoes not always return most preferred repairsSensitive to constraint syntaxDepends on resolution order

Algorithm and Complexity

Detailed complexity analysis for GP/LP and various different types of constraints and preferences

Inherently difficult problemExponential complexity (in general)Main exception: LP is polynomial (in special

cases)

Theoretical complexity is misleading as to the actual performance of the algorithms

Performance in Practice

Performance in practiceLinear with respect to ontology sizeLinear with respect to tree size

Types of violated constraints (tree width)Number of violations (tree height) – causes

the exponential blowupConstraint interdependencies (tree height)Preference (for LP): affects pruning (tree

width)

Further performance improvementUse optimizationsUse LP with restrictive preference

Evaluation Parameters

Evaluation1. Effect of ontology size (for GP/LP)2. Effect of tree size (for GP/LP)3. Effect of violations (for GP/LP)4. Effect of preference (relevant for LP only)5. Quality of LP repairs

Preliminary results support our claims:Linear with respect to ontology sizeLinear with respect to tree size

Publications

Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011

Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF/S DBs. Tentative title, to be submitted to PVLDB, January 2012

Outlook

• Continue refining model based on experience with data sets catalog

• Derive “best practices checks” from metrics

• Results of quality assessment to be added to next release of the catalog

• Collaboration with EU-funded LOD2 (FP7) towards Data Fusion based on the PlanetData Quality Framework

• Finalize experiments for Data Repair

quality and repair pablo n. mendes (freie universität berlin) giorgos flouris (forth) 1st year...

Documents

quality slide

data quality assessment

quality repair slide

quality of sensor data

high quality

access methods data

data dumps

type of data