identifying wrong links between datasets by multi-dimensional outlier detection

17
05/26/14 Heiko Paulheim 1 Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection Heiko Paulheim

Upload: heiko-paulheim

Post on 14-Jun-2015

178 views

Category:

Data & Analytics


0 download

DESCRIPTION

Links between datasets are an essential ingredient of Linked Open Data. Since the manual creation of links is expensive at large-scale, link sets are often created using heuristics, which may lead to errors. In this paper, we propose an unsupervised approach for finding erroneous links. We represent each link as a feature vector in a higher dimensional vector space, and find wrong links by means of different multi-dimensional outlier detection methods. We show how the approach can be implemented in the RapidMiner platform using only off-the-shelf components, and present a first evaluation with real-world datasets from the Linked Open Data cloud showing promising results, with an F-measure of up to 0.54, and an area under the ROC curve of up to 0.86.

TRANSCRIPT

Page 1: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 1

Identifying Wrong Links between Datasetsby Multi-dimensional Outlier Detection

Heiko Paulheim

Page 2: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 2

Motivation

• Dataset interlinks can be wrong for many reasons

– Oversimplified heuristic generation (e.g., label equality)

– owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.)

– Concept drift of link targets

• e.g., dbpedia:Prong used to denote a band until DBpedia 3.1

• now it's a disambiguation page

04/08/0812/04/07

<http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .

Page 3: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 3

Overall Idea

• Links between datasets follow certain patterns

– e.g., linking a mo:MusicArtist to a dbo:Artist,and a mo:MusicalWork to a dbo:Album or a dbo:Song

• Wrong links violate those patterns

• Hence, outlier detection should find wrong links

– Definition: “finding patterns in data that do not conform to the expected normal behavior” (Chandola et al., 2009)

• Difference over related approaches

– does not require the same schema used in both datasets

– nor schema mappings

– no external/human knowledge required

Page 4: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 4

Projection of Links into Vector Space

• Represent each link as a point in an n-dimensional vector space

– e.g., using their direct types

• Outliers are found in sparse areas

Page 5: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 5

Projection of Links into Vector Space

• Types

– each type of LHS and RHS resource becomes a binary (0/1) feature

– types on both sides are treated separately

• i.e., LHS_foaf:person and RHS_foaf:person are distinct features

• Properties

– each ingoing/outgoing property of LHS and RHS resourcebecomes a binary (0/1) feature

– properties on both sides are treated separately

– ingoing and outgoing properties are treated separately

• i.e., LHS_foaf:based_near, RHS_foaf:based_near,foaf:based_near_LHS and foaf:based_near_RHS

are all distinct features

• Joint feature set of types and properties

Page 6: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 6

Experiments

• Datasets: link sets between

– BBC Peel Sessions and DBpedia (2,087 links)

– DBTropes and DBpedia (4,229 links)

• Gold standard

– 100 randomly sampled links from each set, manually evaluated

– Peel: 90 out of 100 are correct

– Tropes: 76 out of 100 are correct

• We run outlier detection on the whole link set

– and validate the output only on the gold standard

Page 7: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Page 8: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 8

Experiments

• Outlier Detection Approaches

– assign a score (or label) to each data point

– the higher the score, the likelier it is an outlier

• Evaluation

– Ordering descending by outlier score

– Ideally, all outliers are above all non-outliers

– Plot a ROC curve to measure the quality

• i.e., AUC

– F-Measure

• with best possible threshold

Page 9: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Page 10: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 10

Results

• Type features work better than property features

• LoOP delivers reliably good results

– though not the best

• Best performance on Peel dataset

– CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857)

• Best performance on DBTropes dataset

– LOF (F1 = 0.5, AUC = 0.619)

Page 11: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 11

Results

• ROC curves for Peel dataset

0 10

1

GAS k=10

GAS k=25

GAS k=50

LOF

LoOP k=10

LoOP k=25

LoOP k=50

CBLOF

LDCOF

1-class SVM

Note: GAS k=10,25,50 identical, LoOP k=25,50 identical

Page 12: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 12

Results

• ROC curves DBTropes dataset

0 10

1

GAS k=10

GAS k=25

GAS k=50

LOF

LoOP k=10

LoOP k=25

LoOP k=50

CBLOF

LDCOF

1-class SVM

Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical, CBLOF and LDCOF mostly identical

Page 13: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 13

Runtimes

• Most outlier detection algorithms are reasonably fast

– both linksets processed in less than 10 seconds on a normal laptop

• Exceptions:

– clustering (for CBLOF/LDCOF) takes up to 30 seconds

– 1-class SVM takes up to 15 minutes

• ...but creating the feature vector representation takes much more time

– some hours against public SPARQL endpoint(s)

– reasonably fast with downloaded dumps

Page 14: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 14

Discussion of Results

• Results on Peel dataset better than on DBTropes dataset

• Projection based on types better than on properties

• most likely due to lower dimensionality of vector space• Peel: #types = 34, #properties = 60

• DBTropes: #types = 81, #properties = 142

• Variation of outlier detection algorithms across datasets

– also observed in other experiments

– general rules of thumb are hard to come up with

Page 15: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 15

Possible Improvements & Future Work

• Other projection methods

– e.g., using numeric counts of relations

• Other outlier detection algorithms

– e.g., Replicating Neural Networks and their generalizations

• Preprocessing

– e.g., Feature Subset Selection

– caveat: the valuable features are often sparse

Page 16: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 16

Possible Improvements & Future Work

• So far, we have looked at owl:sameAs links

• The approach is not limited to that

– should work for other link predicates as well

– e.g., a dataset of persons and a dataset of places

– linked by foaf:based_near

• It is not even limited to linksets

– also for debugging statements inside a knowledge base

– e.g., dbpedia-owl:deathPlace

Page 17: Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 17

Identifying Wrong Links between Datasetsby Multi-dimensional Outlier Detection

Heiko Paulheim