2014-May-07
What is the problem?
What have others done?
What is our solution?
Does it work?
Outline2
What is the problem?
• Linked Open Data (LOD): ▫ Realizing Semantic Web by interlinking existing
but dispersed data
• Main components of LOD:▫URIs to identify things ▫RDF to describe data▫HTTP to access data
3
Datasets: 295Triples: over 30,000,000,000 (30 B)Links: over 500,000,000 (500 M)
4
What is the problem?
Inclusion Criteria for publishing and interlinking datasets into LOD cloud
• resolvable http/https URIs
• Presented in one of the standard formats of Semantic Web (RDF, RDFa, RDF/XML, Turtle, N-Triples)
• Contains at least 1000 triples
• Connected via at least 50 RDF links to the existing datasets of LOD
• Accessible via RDF crawling, RDF dump, or SPARQL endpoint
Is dataset ready to publish?
5
What is the problem?
6
Idea of the LOD: Publishing first, improving later
Results in: quality problems in the published datasets
Missing link:
What is the problem?
Data Quality evaluation before release
Data quality in the Context of LOD
• General Validators
• Parsing and Syntax
• Accessibility / Dereferencability
Validators Quality Assessment of Published data
• Classifying quality problems of LOD
• Using metadata for quality assessment
• filtering poor quality data (WIQA)
• Semantic Annotation using ontologies
7
What have others done?
Limitations of related works:•Syntax validation, not quality
evaluation
•Not scalable
•Not full automated
•Evaluation after publishing
8
What have others done?
What is our solution?
Proposing a set of metrics for
Inherent quality assessment of datasets
before interlinking to LOD cloud
9
Quality Prediction
Empirical Evaluation
Theoretical Validation
Developing a Quality Model
Proposing Metrics
Selecting Inherent Quality Dimensions
10
What is our solution?
Studying data quality models
Defining inherent quality of LOD
Selecting the basic model
(ISO-25012)Mapping quality
dimensions of ISO to LOD
11
1. Selecting Inherent Quality Dimensions
Inherent Quality of LOD
Interlinking
Completeness
Semantic Accuracy
Syntax Accuracy
Uniqueness
Consistency
12
1. Selecting Inherent Quality Dimensions
Defining metrics using GQM
Implementing an automated tool Formal definition
13
2. Proposing Metrics
Example:Goal: Assessment of the consistency of a dataset in the context of LODQuestion: What is the degree of conflict in the context of data value?Metric: The number of functional properties with inconsistent values
14
LODQM: Linked Open Data Quality Model
• 6 Quality dimensions• 32 Metrics
3. Developing LODQM
Using Theoretical Measurement Framework
Identifying properties of
desirable metricsValidating
metrics
15
4. Theoretical Validation
Metric TypeNumber
of metricsNull-
Value
Non-
NegativitySymmetry Monotonicity
Disjoint
Module
AdditivityMerging
Cohesive
Modules
Complexity 29 √ √ √ √ n/a _ _
Cohesion 2 √ √ _ √ _ _ √
Coupling 1 √ √ _ √ n/a √_
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets
Comparing the trends of Metrics over two observationsCollecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions
16
5. Empirical Evaluation 5.1
5.2
5.3
5.4
5.5
5.6
5.7
17
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets
Comparing the trends of Metrics over two observations
Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions
DatasetsNo. of triples
No. of instances
No. of classes
No. of properties
FAO Water Areas 10,730 586 31 19
Water Economic Zones 29,193 1,074 113 127
Large Marine Ecosystems 12,012 716 21 31
Geopolitical Entities 22,725 312 88 101
ISSCAAP Species Classification 398,166 25,253 52 93
Species Taxonomic Classification 319,490 11,741 33 26
Commodities 56,420 2,788 10 19
Vessels 4,236 240 6 22
5. Empirical Evaluation √
18
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets
Comparing the trends of Metrics over two observations
Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions
√√
5. Empirical Evaluation
19
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets using heuristics
Comparing the trends of Metrics over two observations
Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions
√√√
5. Empirical Evaluation
Result:• Three pairs of metrics are correlated:
{IFP, Im_DT}{Im_DT, Sml_Cls} {Inc_Prp_Vlu, IF}
• The others are independent
20
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets using heuristics
Comparing the trends of Metrics over two observations
Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions
√√√√
5. Empirical Evaluation
21
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets using heuristics
Comparing the trends of Metrics over two observations
Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions
√√√√√√
5. Empirical Evaluation
22
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets using heuristics
Comparing the trends of Metrics over two observations
Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions
√√√√√√
5. Empirical Evaluation
Result:• Only one pair of quality dimensions is correlated:
{Interlinking, Syntactic accuracy}
• The others are independent
√
Applying PCA Method to select the highly correlated metrics
Developing predictive models
Assessing the quality of new datasets using
models
23
6. Quality Prediction
Result:
20 out of 32 metrics are selected
Using Neural Network Method:
MultiLayerPerceptron
Dataset No. of triples No. of instances Domain
Geonames 6,590 699 Geography
IMDB 866 291 Movie
Anatomy 6,449 6449 Anatomy
Citeseer 948,770 173963 Publication
FAO 248,731 28,098 Food Science
24
6. Quality Prediction
Conclusion on Metrics25
Definable
•Proposed by GQM (32)
•Formally defined (32)
Valid
•Theoretically validated (32)
Practical
•Implemented (32)
Correlated with quality
•Experts (28)
•Correlation study (27)
•PCA (20)
Predictability
•MLP (20)
Appreciative of your
Attention and Comments