quality and repair pablo n. mendes (freie universität berlin) giorgos flouris (forth) 1st year...
Post on 22-Dec-2015
219 views
TRANSCRIPT
Quality and Repair
Pablo N. Mendes (Freie Universität Berlin)Giorgos Flouris (FORTH)
1st year reviewLuxembourg, December 2011
11/02/11
18 24 30 366 120
Task 2.1Data quality assessment and repair
Task 2.3Recommendations for enhancing best practices for data publishing
D2.4 Update of D2.1
D2.3 Modelling and processing contextual aspects of data
D2.5 Proof-of-concept evaluation for modelling space and time
FUBFUB
42 48D2.1 Conceptual model and best practices for high-quality data publishing
D2.1 Conceptual model and best practices for high-quality data publishing
D2.2 Methods for quality repairD2.2 Methods for quality repair
KITKIT
KITKIT
Work Plan View WP2
D2.6 Methods for assessing the quality of sensor data
D2.7 Recommendations for contextual data publishing
Task 2.2Temporal, spatial and social aspects of data
Upcoming deliverables
Quality AssessmentD2.1 - Conceptual model and best practices for high-quality metadata publishing
Quality EnhancementD2.2 - Methods for quality repair
Outline
Overview of Quality
Data Quality Framework
Quality Assessment
Quality Enhancement (Repair)
“Fitness for use.”
Joseph Juran. The Quality Control Handbook. McGraw-Hill,New York, 3rd edition, 1974.
Quality
Data Quality
Multifaceted
accurate = high quality?
availability?
timeliness?
Subjective
weekly updates are ok.
Task-dependent
task: weather forecast
data is not good if it is not available for online query
vacation planning or aviation?
for me, for vacation planning
Category Dimension
Intrinsic Dimensions
AccuracyConsistencyObjectivityTimeliness
Contextual Dimensions
ValidityBelievabilityCompletenessUnderstandabilityRelevancyReputationVerifiabilityAmount of Data
Representational Dimensions
InterpretabilityRep. ConcisenessRep. Consistency
Accessibility Dimensions
AvailabilityResponse TimeSecurity
Data Quality Dimensions
Presentation order
Quality Enhancement Quality Assessment
Data Quality Framework
ACCESSIBILITY
Dereferenceability
• Indicator: Dereferenceable URIs• “Resources identified by URIs that respond
with RDF to HTTP requests?”• Metrics:
• for datasets (d) and for resources (r)• deref(d) = count(r | deref(r))• ratioderef(d) = deref(d) / no-deref(r)
• Recommendation:• Your URIs should be dereferenceable.• Prefer reusing URIs that are dereferenceable.
Access methods
• Indicator: Access methods• “Data is accessible in varied and recommended ways.”
• Metrics:• sample(d): {0,1} “example resource available for d”• endpoint(d): {0,1} “SPARQL endpoint available for d”• dump(d): {0,1} “RDF dumps available for d”
• Recommendation:• Provide as many access methods as possible• A sample resource provides a quick view into the type
of data you serve.• SPARQL endpoints for clients to obtain part of the data• Dumps are cheaper than alternatives when bulk access
is needed
ACCESSIBILITY
Availability
• Indicator: Availability• “Average availability in time interval”
• Metrics: • avail(d,hour) = ∑{1..24} deref(sample(d)) / 24• Alternatively, httphead() instead of deref()
• Recommendation: • the higher the better
ACCESSIBILITY
Accessiblity Dimensions
DereferenceabilityAvailabilityAccess methodsResponse timeRobustnessReachability
http GET / HEADhourly derefsURI, Bulk, SPARQLtimed derefrequests per minuteLOD cloud inlinks
ACCESSIBILITY
Examples:
Representational: Interpretability
• Indicator: Human/Machine interpretability• “URI is dereferenceable to human and machine
readable formats”
• Metrics:• format(deref(r,f)) in {Fh U Fm} : {0,1}
• Fh = HTML, XHTML+RDFa, ...: {0,1}
• Fm = NT, RDF/XML, ...: {0,1}
• Recommendation:• Resources should dereference at least to human-
readable HTML and one widely adopted RDF serialization.
REPRESENTATIONAL
REPRESENTATIONAL
Vocabulary understandability
• Schema understandability• “Schema terms are familiar to existing
agents.”
• Metrics:• vocab-underst(d) = triples(v,d) * triples(v,D) / triples(D)• Alt: Page Rank (prob. that random surfer has found v)
• Recommendation:• Reuse widely deployed vocabularies.
Representational Dimensions
Human/Machine Interpretability
Vocabulary Understandability
Representational Conciseness
HTML, RDF
Vocabulary usage stats
Triples / Byte
REPRESENTATIONAL
Contextual Dimensions
CompletenessFull set of objects and attributes wrt to a task
ConcisenessAmount of duplicate entries, redundant attributes
CoherenceHow well instance data conforms to schema
CONTEXTUAL DIMENSIONS
Contextual Dimensions
VerifiabilityHow easy it is to check the data? Can use provenance information.
ValidityEncodes context- or application-specific requirements
CONTEXTUAL DIMENSIONS
INTRINSIC DIMENSIONS
Intrinsic Dimensions
Accuracy
usually estimated; may be available for sensors
Timeliness
can use last update
Consistency
two or more values do not conflict with each other
Objectivity
Can be traced via provenance
Example: AEMET
Metadata entry: http://thedatahub.org/dataset/aemet
Example item: http://aemet.linkeddata.es/page/resource/WeatherStation/id08001?output=ttl
Access methods: Example URI, SPARQL, BulkAvailability:
Example URI: availableSPARQL Endpoint: 100%
Format Interpretability: TTL=OKRDF/XML=OK
Verifiability: Published by third party
http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=aemet
Quality Enhancement Quality Assessment
Data Quality Framework
Validity as a Quality Indicator
Validity is an important quality indicatorEncodes context- or application-specific requirementsApplications may be useless over invalid dataBinary concept (valid/invalid)
Two steps to guarantee validity (repair process):1. Identifying invalid ontologies (diagnosis)
Detecting invalidities in an automated mannerSubtask of Quality Assessment
2. Remove invalidities (repair)Repairing invalidities in an automated mannerSubtask of Quality Enhancement
Diagnosis
Expressing validity using validity rules over an adequate relational schema
Examples:Properties must have a unique domain
p Prop(p) a Dom(p,a)p,a,b Dom(p,a) Dom(p,b) (a=b)
Correct classification in property instancesx,y,p,a P_Inst(x,y,p) Dom(p,a)
C_Inst(x,a)x,y,p,a P_Inst(x,y,p) Rng(p,a)
C_Inst(y,a)
Diagnosis reduced to relational queries
Ontology O0
Class(Sensor), Class(SpatialThing), Class(Observation)Prop(geo:location)Dom(geo:location,Sensor)Rng(geo:location,SpatialThing)Inst(Item1), Inst(ST1)P_Inst(Item1,ST1,geo:location)C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing)
Example
Correct classification in property instances x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)
Sensor SpatialThing
Observation
Item1 ST1
geo:location
Schema
Data
Item1 geo:location ST1 Sensor is the domain of geo:locationItem1 is not a Sensor
P_Inst(Item1,ST1,geo:location)O0
Remove P_Inst(Item1,ST1,geo:location)
Add C_Inst(Item1,Sensor)Remove Dom(geo:location,Sensor)
C_Inst(Item1,Sensor)O0
Dom(geo:location,Sensor)O0
Preferences for Repair
Which repairing option is best?Ontology engineer determines that via
preferences
Specified by ontology engineer beforehandHigh-level “specifications” for the ideal
repairServe as “instructions” to determine the
preferred solution
Preferences (On Ontologies)
O0
O2
O3
Score: 3
Score: 4
Score: 6
O1
Preferences (On Deltas)
O0
O1
O2
O3Score: 2
Score: 4
Score: 5
-P_Inst (Item1,ST1, geo:location)
+C_Inst (Item1,Sensor)
-Dom (geo:location,
Sensor)
Preferences
Preferences on ontologies are result-orientedConsider the quality of the repair resultIgnore the impact of repairPopular options: prefer newest information, prefer
trustable informationPreferences on deltas are more impact-oriented
Consider the impact of repairIgnore the quality of the repair resultPopular options: minimize schema changes, minimize
addition/deletion of information, minimize delta sizeTwo sides of the same coin (equivalent options)
Quality metrics can be used for stating preferencesMetadata on the data may be neededCan be qualitative or quantitative
Generalizing the Approach
For one violated constraint1. Diagnose invalidity2. Determine minimal ways to resolve it3. Determine and return preferred resolution
For many violated constraintsProblem becomes more complicatedMore than one resolution steps are required
Issues:1. Resolution order2. When and how to filter non-preferred solutions?3. Constraint (and resolution) interdependencies
Constraint Interdependencies
A given resolution may:Cause other violations (bad)Resolve other violations (good)
Cannot pre-determine the best resolutionDifficult to predict the ramifications of each oneExhaustive search requiredRecursive, tree-based search (resolution tree)
Two ways to create the resolution tree Globally-preferred (GP), locally-preferred (LP)When and how to filter non-preferred solutions?
Resolution Tree Creation (GP)
Find all minimal resolutions for all the violated constraints, then find the preferred ones
Globally-preferred (GP)Find all minimal resolutions for
one violationExplore them allRepeat recursively until
consistentReturn the preferred leaves
Preferred repairs (returned)
Resolution Tree Creation (LP)
Find the minimal and preferred resolutions for one violated constraint, then repeat for the next
Locally-preferred (LP)Find all minimal resolutions for
one violationExplore the preferred one(s)Repeat recursively until
consistentReturn all remaining leaves
Preferred repair (returned)
Comparison (GP versus LP)
Characteristics of GP ExhaustiveLess efficient: large resolution treesAlways returns most preferred repairsInsensitive to constraint syntaxDoes not depend on resolution order
Characteristics of LPGreedyMore efficient: small resolution treesDoes not always return most preferred repairsSensitive to constraint syntaxDepends on resolution order
Algorithm and Complexity
Detailed complexity analysis for GP/LP and various different types of constraints and preferences
Inherently difficult problemExponential complexity (in general)Main exception: LP is polynomial (in special
cases)
Theoretical complexity is misleading as to the actual performance of the algorithms
Performance in Practice
Performance in practiceLinear with respect to ontology sizeLinear with respect to tree size
Types of violated constraints (tree width)Number of violations (tree height) – causes
the exponential blowupConstraint interdependencies (tree height)Preference (for LP): affects pruning (tree
width)
Further performance improvementUse optimizationsUse LP with restrictive preference
Evaluation Parameters
Evaluation1. Effect of ontology size (for GP/LP)2. Effect of tree size (for GP/LP)3. Effect of violations (for GP/LP)4. Effect of preference (relevant for LP only)5. Quality of LP repairs
Preliminary results support our claims:Linear with respect to ontology sizeLinear with respect to tree size
Publications
Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011
Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF/S DBs. Tentative title, to be submitted to PVLDB, January 2012
Outlook
• Continue refining model based on experience with data sets catalog
• Derive “best practices checks” from metrics
• Results of quality assessment to be added to next release of the catalog
• Collaboration with EU-funded LOD2 (FP7) towards Data Fusion based on the PlanetData Quality Framework
• Finalize experiments for Data Repair