Transcript
Page 1: Metrics-Driven Approach for LOD  Quality  Assessment

2014-May-07

Page 2: Metrics-Driven Approach for LOD  Quality  Assessment

What is the problem?

What have others done?

What is our solution?

Does it work?

Outline2

Page 3: Metrics-Driven Approach for LOD  Quality  Assessment

What is the problem?

• Linked Open Data (LOD): ▫ Realizing Semantic Web by interlinking existing

but dispersed data

• Main components of LOD:▫URIs to identify things ▫RDF to describe data▫HTTP to access data

3

Page 4: Metrics-Driven Approach for LOD  Quality  Assessment

Datasets: 295Triples: over 30,000,000,000 (30 B)Links: over 500,000,000 (500 M)

4

What is the problem?

Page 5: Metrics-Driven Approach for LOD  Quality  Assessment

Inclusion Criteria for publishing and interlinking datasets into LOD cloud

• resolvable http/https URIs

• Presented in one of the standard formats of Semantic Web (RDF, RDFa, RDF/XML, Turtle, N-Triples)

• Contains at least 1000 triples

• Connected via at least 50 RDF links to the existing datasets of LOD

• Accessible via RDF crawling, RDF dump, or SPARQL endpoint

Is dataset ready to publish?

5

What is the problem?

Page 6: Metrics-Driven Approach for LOD  Quality  Assessment

6

Idea of the LOD: Publishing first, improving later

Results in: quality problems in the published datasets

Missing link:

What is the problem?

Data Quality evaluation before release

Page 7: Metrics-Driven Approach for LOD  Quality  Assessment

Data quality in the Context of LOD

• General Validators

• Parsing and Syntax

• Accessibility / Dereferencability

Validators Quality Assessment of Published data

• Classifying quality problems of LOD

• Using metadata for quality assessment

• filtering poor quality data (WIQA)

• Semantic Annotation using ontologies

7

What have others done?

Page 8: Metrics-Driven Approach for LOD  Quality  Assessment

Limitations of related works:•Syntax validation, not quality

evaluation

•Not scalable

•Not full automated

•Evaluation after publishing

8

What have others done?

Page 9: Metrics-Driven Approach for LOD  Quality  Assessment

What is our solution?

Proposing a set of metrics for

Inherent quality assessment of datasets

before interlinking to LOD cloud

9

Page 10: Metrics-Driven Approach for LOD  Quality  Assessment

Quality Prediction

Empirical Evaluation

Theoretical Validation

Developing a Quality Model

Proposing Metrics

Selecting Inherent Quality Dimensions

10

What is our solution?

Page 11: Metrics-Driven Approach for LOD  Quality  Assessment

Studying data quality models

Defining inherent quality of LOD

Selecting the basic model

(ISO-25012)Mapping quality

dimensions of ISO to LOD

11

1. Selecting Inherent Quality Dimensions

Page 12: Metrics-Driven Approach for LOD  Quality  Assessment

Inherent Quality of LOD

Interlinking

Completeness

Semantic Accuracy

Syntax Accuracy

Uniqueness

Consistency

12

1. Selecting Inherent Quality Dimensions

Page 13: Metrics-Driven Approach for LOD  Quality  Assessment

Defining metrics using GQM

Implementing an automated tool Formal definition

13

2. Proposing Metrics

Example:Goal: Assessment of the consistency of a dataset in the context of LODQuestion: What is the degree of conflict in the context of data value?Metric: The number of functional properties with inconsistent values

Page 14: Metrics-Driven Approach for LOD  Quality  Assessment

14

LODQM: Linked Open Data Quality Model

• 6 Quality dimensions• 32 Metrics

3. Developing LODQM

Page 15: Metrics-Driven Approach for LOD  Quality  Assessment

Using Theoretical Measurement Framework

Identifying properties of

desirable metricsValidating

metrics

15

4. Theoretical Validation

Metric TypeNumber

of metricsNull-

Value

Non-

NegativitySymmetry Monotonicity

Disjoint

Module

AdditivityMerging

Cohesive

Modules

Complexity 29 √ √ √ √ n/a _ _

Cohesion 2 √ √ _ √ _ _ √

Coupling 1 √ √ _ √ n/a √_

Page 16: Metrics-Driven Approach for LOD  Quality  Assessment

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observationsCollecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions

16

5. Empirical Evaluation 5.1

5.2

5.3

5.4

5.5

5.6

5.7

Page 17: Metrics-Driven Approach for LOD  Quality  Assessment

17

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions

DatasetsNo. of triples

No. of instances

No. of classes

No. of properties

FAO Water Areas 10,730 586 31 19

Water Economic Zones 29,193 1,074 113 127

Large Marine Ecosystems 12,012 716 21 31

Geopolitical Entities 22,725 312 88 101

ISSCAAP Species Classification 398,166 25,253 52 93

Species Taxonomic Classification 319,490 11,741 33 26

Commodities 56,420 2,788 10 19

Vessels 4,236 240 6 22

5. Empirical Evaluation √

Page 18: Metrics-Driven Approach for LOD  Quality  Assessment

18

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions

√√

5. Empirical Evaluation

Page 19: Metrics-Driven Approach for LOD  Quality  Assessment

19

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions

√√√

5. Empirical Evaluation

Result:• Three pairs of metrics are correlated:

{IFP, Im_DT}{Im_DT, Sml_Cls} {Inc_Prp_Vlu, IF}

• The others are independent

Page 20: Metrics-Driven Approach for LOD  Quality  Assessment

20

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions

√√√√

5. Empirical Evaluation

Page 21: Metrics-Driven Approach for LOD  Quality  Assessment

21

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions

√√√√√√

5. Empirical Evaluation

Page 22: Metrics-Driven Approach for LOD  Quality  Assessment

22

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions

√√√√√√

5. Empirical Evaluation

Result:• Only one pair of quality dimensions is correlated:

{Interlinking, Syntactic accuracy}

• The others are independent

Page 23: Metrics-Driven Approach for LOD  Quality  Assessment

Applying PCA Method to select the highly correlated metrics

Developing predictive models

Assessing the quality of new datasets using

models

23

6. Quality Prediction

Result:

20 out of 32 metrics are selected

Using Neural Network Method:

MultiLayerPerceptron

Dataset No. of triples No. of instances Domain

Geonames 6,590 699 Geography

IMDB 866 291 Movie

Anatomy 6,449 6449 Anatomy

Citeseer 948,770 173963 Publication

FAO 248,731 28,098 Food Science

Page 24: Metrics-Driven Approach for LOD  Quality  Assessment

24

6. Quality Prediction

Page 25: Metrics-Driven Approach for LOD  Quality  Assessment

Conclusion on Metrics25

Definable

•Proposed by GQM (32)

•Formally defined (32)

Valid

•Theoretically validated (32)

Practical

•Implemented (32)

Correlated with quality

•Experts (28)

•Correlation study (27)

•PCA (20)

Predictability

•MLP (20)

Page 26: Metrics-Driven Approach for LOD  Quality  Assessment

Appreciative of your

Attention and Comments


Top Related