identifying wrong links between datasets by multi-dimensional outlier detection

05/26/14 Heiko Paulheim 1

Identifying Wrong Links between Datasetsby Multi-dimensional Outlier Detection

Heiko Paulheim


Motivation

• Dataset interlinks can be wrong for many reasons

– Oversimplified heuristic generation (e.g., label equality)

– owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.)

– Concept drift of link targets

• e.g., dbpedia:Prong used to denote a band until DBpedia 3.1

• now it's a disambiguation page

04/08/0812/04/07

<http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .


Overall Idea

• Links between datasets follow certain patterns

– e.g., linking a mo:MusicArtist to a dbo:Artist,and a mo:MusicalWork to a dbo:Album or a dbo:Song

• Wrong links violate those patterns

• Hence, outlier detection should find wrong links

– Definition: “finding patterns in data that do not conform to the expected normal behavior” (Chandola et al., 2009)

• Difference over related approaches

– does not require the same schema used in both datasets

– nor schema mappings

– no external/human knowledge required


Projection of Links into Vector Space

• Represent each link as a point in an n-dimensional vector space

– e.g., using their direct types

• Outliers are found in sparse areas


Projection of Links into Vector Space

• Types

– each type of LHS and RHS resource becomes a binary (0/1) feature

– types on both sides are treated separately

• i.e., LHS_foaf:person and RHS_foaf:person are distinct features

• Properties

– each ingoing/outgoing property of LHS and RHS resourcebecomes a binary (0/1) feature

– properties on both sides are treated separately

– ingoing and outgoing properties are treated separately

• i.e., LHS_foaf:based_near, RHS_foaf:based_near,foaf:based_near_LHS and foaf:based_near_RHS

are all distinct features

• Joint feature set of types and properties


Experiments

• Datasets: link sets between

– BBC Peel Sessions and DBpedia (2,087 links)

– DBTropes and DBpedia (4,229 links)

• Gold standard

– 100 randomly sampled links from each set, manually evaluated

– Peel: 90 out of 100 are correct

– Tropes: 76 out of 100 are correct

• We run outlier detection on the whole link set

– and validate the output only on the gold standard


Experiments

• Outlier Detection Approaches

– assign a score (or label) to each data point

– the higher the score, the likelier it is an outlier

• Evaluation

– Ordering descending by outlier score

– Ideally, all outliers are above all non-outliers

– Plot a ROC curve to measure the quality

• i.e., AUC

– F-Measure

• with best possible threshold


Results

• Type features work better than property features

• LoOP delivers reliably good results

– though not the best

• Best performance on Peel dataset

– CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857)

• Best performance on DBTropes dataset

– LOF (F1 = 0.5, AUC = 0.619)


Results

• ROC curves for Peel dataset

0 10

1

GAS k=10

GAS k=25

GAS k=50

LOF

LoOP k=10

LoOP k=25

LoOP k=50

CBLOF

LDCOF

1-class SVM

Note: GAS k=10,25,50 identical, LoOP k=25,50 identical


Results

• ROC curves DBTropes dataset

0 10

1

GAS k=10

GAS k=25

GAS k=50

LOF

LoOP k=10

LoOP k=25

LoOP k=50

CBLOF

LDCOF

1-class SVM

Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical, CBLOF and LDCOF mostly identical


Runtimes

• Most outlier detection algorithms are reasonably fast

– both linksets processed in less than 10 seconds on a normal laptop

• Exceptions:

– clustering (for CBLOF/LDCOF) takes up to 30 seconds

– 1-class SVM takes up to 15 minutes

• ...but creating the feature vector representation takes much more time

– some hours against public SPARQL endpoint(s)

– reasonably fast with downloaded dumps


Discussion of Results

• Results on Peel dataset better than on DBTropes dataset

• Projection based on types better than on properties

• most likely due to lower dimensionality of vector space• Peel: #types = 34, #properties = 60

• DBTropes: #types = 81, #properties = 142

• Variation of outlier detection algorithms across datasets

– also observed in other experiments

– general rules of thumb are hard to come up with


Possible Improvements & Future Work

• Other projection methods

– e.g., using numeric counts of relations

• Other outlier detection algorithms

– e.g., Replicating Neural Networks and their generalizations

• Preprocessing

– e.g., Feature Subset Selection

– caveat: the valuable features are often sparse


Possible Improvements & Future Work

• So far, we have looked at owl:sameAs links

• The approach is not limited to that

– should work for other link predicates as well

– e.g., a dataset of persons and a dataset of places

– linked by foaf:based_near

• It is not even limited to linksets

– also for debugging statements inside a knowledge base

– e.g., dbpedia-owl:deathPlace


Identifying Wrong Links between Datasetsby Multi-dimensional Outlier Detection

Heiko Paulheim

identifying wrong links between datasets by multi-dimensional outlier detection

Data & Analytics

links dbtropes

projection of links

identical loop

outlier score

sampled links

dbtropes dataset projection

feature types

feature properties