fusing semantic data

Fusing automatically extracted annotations for the Semantic

WebAndriy NikolovKnowledge Media InstituteThe Open University

2

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

3

Database scenario

• Classical scenario (database domain)– Merging information from datasets

containing partially overlapping information

Name Year of birth E-mail Address

H. Schmidt 1972 [email protected] …

J. Smith 1983 [email protected] …

Name Year of birth E-Mail Job position

Wen, Zhao 1980 [email protected] …

Schmidt, Hans 1973 [email protected] …

4

Database scenario

• Coreference resolution (record linkage)– Resolving ambiguous identities







5

Database scenario

• Inconsistency resolution– Handling contradictory pieces of data







6

Semantic data scenario

• Database domain:– A record belongs to a single

table– Table structure defines

relevant attributes– Inconsistency of values

• Semantic data:– Classes are organised into

hierarchies– One individual may belong

to several classes– Available properties depend

on the level of granularity– Other types of

inconsistencies are possible• E.g., class disjointness

foaf:Person

sweto:Person

foaf:namexsd:string

xsd:stringfoaf:mbox

sweto:Placesweto:lives_in

sweto:Organization

sweto:affiliated_with

sweto:Researcherxsd:stringsome:has_degree

7

Motivating scenario – X-Media

RDF

Images

Other data

Annotation FusionText

Internal corporate reports (Intranet)

Pre-defined public sources (WWW)

Domain ontology

KnoFuss

Knowledge base

8

Outline





9

Handling fusion subtasks

• For each subtask, several available methods exist

• Example: coreference resolution– Aggregated attribute similarity

• [Fellegi&Sunter 1969]

– String similarity• Levenshtein, Jaro, Jaro-Winkler

– Machine learning• Clustering• Classification

– Rule-based

10

Handling fusion subtasks

• All methods have their pros and cons– Rule-based

• High precision• Restricted to a specific domain

– Machine learning• Require sufficient training data

– String similarity• Lower precision• Still need configuration (e.g., distance metric, threshold, set of

attributes to include)

• Trade-off between the quality of results and applicability range – better precision requires more domain-specific knowledge

11

Problem-solving method approach

• Fusion task is decomposed into subtasks• Algorithms defined as methods solving a particular

task• Each method is formally described using the fusion

ontology– Task handled by the method– Applicability criteria– Domain knowledge required– Reliability of output

• Methods are selected based on their capabilities

12

KnoFuss architecture

Fusion KBIntermediate data

Main KB

KnoFuss

CoreferenceResolutionMethod

ConflictDetectionMethod

ConflictResolutionMethod

Method library

New data

Fusion ontology

• Method library– Contains implementation of each technique for specific

subtasks (problem-solving method [Motta 1999])• Fusion ontology

– Describes method capabilities– Defines intermediate structures (mappings, conflict sets, etc.)

13

Task decomposition

Knowledge fusion

Coreferenceresolution

Knowledge base

updating

Modelconfiguration

Dependency identification

Dependency resolution

Linkdiscovery

Source KB

TargetKB

(fused)

TargetKB

14

Method selection

Adaptive learning matcher Application context:

Publication

Application context:Journal Article

rdf:type owl:Thing

datatypeProperty ?x

reliability =0.4

rdf:type sweto:Publication

rdfs:label ?x

sweto:year ?y

reliability =0.8

rdf:type sweto:Article

rdfs:label ?x

sweto:year ?y

sweto:journal ?z

sweto:volume ?a

reliability =0.9

• Depends on:– Range of applicability– Reliability

• Configuration parameters– Generic (for individuals of unknown types)– Context-dependent

15

Using class hierarchy

• Configuring machine-learning methods:– Using training instances for a subclass to learn

generic models for superclasses

owl:Thing

foaf:Person foaf:Document

sweto:Publication

sweto:Article sweto:Article_in_Proceedings

year

name

volume book_title

label

journal_name

Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}

Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}

Ind1: {label}Ind2: {label}Ind3: {label}

16

Using class hierarchy

• Configuring machine-learning methods:– Combining training instances for subclasses to

learn a generic model for a superclasssweto:Publication

sweto:Article sweto:Article_in_Proceedings

year

volumebook_title

label

journal_name

Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}

Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}Ind4: {label, year}Ind5: {label, year}Ind6: {label, year}

Ind4: {label, year, journal_name, volume}Ind5: {label, year, journal_name, volume}Ind6: {label, year, journal_name, volume}

17

Outline





18

Data quality problems

• Causes of inconsistency– Data errors

• Obsolete data• Mistakes of manual annotators• Errors of information extraction algorithms

– Coreference resolution errors• Automatic methods not 100% reliable

• Applying uncertainty reasoning– Estimated reliability of separate pieces of

data– Domain knowledge defined in the ontology

19

Refining fused data

• Additional evidence:– Ontological schema restrictions

• Disjointness• Cardinality• …

– Neighborhood graph• Mappings between related entities

– Provenance• Uncertainty of candidate mappings• Uncertainty of data statements• “Cleanness” of data sources

20

Dempster-Shafer theory of evidence

• Bayesian probability theory:Assigning probabilities to atomic alternatives: – p(true)=0.6 ! p(false)=0.4 – Sometimes hard to assign– Negative bias:

Extraction uncertainty less than 0.5 – negative evidence rather than insufficient evidence

Dempster-Shafer theory: Assigning confidence degrees (masses) to sets of alternatives– m({true}) = 0.6– m({false}) = 0.1– m({true;false})=0.3

probability

support

plausibility

21

Dependency detection

• Identifying and localizing conflicts– Using formal diagnosis [Reiter 1987] in

combination with standard ontological reasoning

ArticleArticle ProceedingsProceedings

Paper_10Paper_10

owl:disjointWithowl:disjointWith owl:FunctionalPropertyowl:FunctionalProperty

rdf:typerdf:type

hasYear2007

hasYear 2006

E. MottaE. Motta V.S. UrenV.S. UrenhasAuthor hasAuthor

22

Belief networks (cont)

• Valuation networks [Shenoy and Shafer 1990]

• Network nodes – OWL axioms– Variable nodes

• ABox statements (I2X, R(I1, I2))

• One variable – the statement itself

– Valuation nodes• TBox axioms (XtY)• Mass distribution between several variables

(I2X, I2Y, I2XtY)

23

Belief networks (cont)

• Belief network construction– Using translation rules– Rule antecedents:

• Existence of specific OWL axioms (one rule per OWL construct)

• Existence of network nodes– Example rule:

• Explicit ABox statements:IF I2X THEN CREATE N1(I2X)

• TBox inferencing:IF Trans(R) AND EXIST N1(R(I1, I2)) AND EXIST N2(R(I2, I3)) THEN CREATE N3(Trans(R)) AND CREATE N4(R(I1, I3))

24

Example

#Paper_10#Paper_10


owl:disjointWithowl:disjointWith

rdf:type rdf:type

25

Example

#Paper_10#Paper_10



rdf:type rdf:type

#Paper_102Article

#Paper_102Proceedings

26

Example

#Paper_10#Paper_10



rdf:type rdf:type

#Paper_102Article


Article v :Proceedings

27

Example

#Paper_102Article



m(true)=0.8m(false) = 0

m({true;false})=0.2


m({true;false})=0.4

#Paper_102 Article

#Paper_102 Proceedings

false false

false true

true false

true true

m( )=1.0

m( )=0.0

28

Example

#Paper_102Article




m({true;false})=0.2


m({true;false})=0.4

#Paper_102 Article


false false

false true

true false

false true

true false

m( )=0.15 -Dempster’s rule

m(

m(

)=0.23

)=0.62

29

Example

#Paper_102Article



m(true)=0.62m(false) = 0.23

m({true;false})=0.15

m(true)=0.23m(false) = 0.62

m({true;false})=0.15

#Paper_102 Article


false false

false true

true false

false true

true false

m( )=0.15

m(

m(

)=0.23

)=0.62

30

Belief propagation

• Translating subontology into a belief network– Using provenance and confidence values of data statements– Coreferencing algorithm precision for owl:sameAs mappings

• Data refinement:– Detecting spurious mappings– Removing unreliable data statements

Articlev:in_Proc

Ind1=Ind2

Functional(year)

Article(Ind1)

(0.99;1.0)/(0.97;0.98)

in_Proc(Ind2) Ind1=Ind2

inProc(Ind1)

Ind1=Ind2

year(Ind1, 2007)

year(Ind2, 2007)

year(Ind1, 2006)

(0.9;1.0)/(0.74;0.82) (0.92;1.0)/(0.2;0.21) (0.85;1.0)/(0.72;0.85)

(0.95;1.0)/(0.91;0.96)

31

Neighbourhood graph

• Non-functional relations: varying impact

Paper_10Paper_10 H. SchmidtH. SchmidthasAuthor

Paper_11Paper_11 Schmidt, HansSchmidt, Hans

owl:sameAs (0.9) owl:sameAs (0.3)

hasAuthor

ProceedingsProceedings

rdf:type

rdf:type

PersonPerson

GermanyGermany H. SchmidtH. Schmidtcitizen_of

GermanyGermany Schmidt, HansSchmidt, Hans

owl:sameAs (1.0) owl:sameAs (0.3)

citizen_of

CountryCountry

rdf:type

rdf:type

PersonPerson

32

Neighborhood graph

• Implicit relations: set co-membership

Person11 = Person12

Person21 = Person22

Coauthor(Person12, Person22)

Person11 = Person12


Person21 = Person22


“Bard, J.B.L.”=“Jonathan Bard”

“Webber, B.L.”=“Bonnie L. Webber”

0.84/(0.86;1.0)

0.16/(0.83;1.0)1.0/(1.0;1.0)

1.0/(1.0;1.0)

33

Provenance

• Initial belief assignments:– Data statements

(source AND/OR extractor confidence)

– Candidate mappings (precision of attribute similarity algorithms)

– Source “cleanness” – contains duplicates or not

Arl_Va Arl_Tx

Arlington = Arl_Tx

Arlington Arl_Tx

Arlington = Arl_Va

Arlington = Arl_Va

Arl_Va Arl_Tx

1.0/(1.0;1.0)

0.9/(0.31;0.35)

Arlington, Virginia

0.95/(0.65;0.69)

Arlington, Texas

34

Experiments

• Datasets:– Publication 1

• AKT• Rexa• SWETO-DBLP

– Cora• database community benchmark• translated into RDF• 2 versions used

– different structure – different gold standard

35

Experiments

Dataset No Matcher Publication

Prec. Recall F1 Prec. Recall F1

AKT/Rexa 1 Jaro-Winkler 0.950 0.833 0.887 0.969 0.832 0.895

2 L2 Jaro-Winkler 0.879 0.956 0.916 0.923 0.956 0.939

AKT/DBLP 3 Jaro-Winkler 0.922 0.952 0.937 0.992 0.952 0.971

4 L2 Jaro-Winkler 0.389 0.984 0.558 0.838 0.983 0.905

Rexa/DBLP 5 Jaro-Winkler 0.899 0.933 0.916 0.944 0.932 0.938

6 L2 Jaro-Winkler 0.546 0.982 0.702 0.823 0.981 0.895

Cora (I) 7 Monge-Elkan 0.735 0.931 0.821 0.939 0.836 0.884

Cora (II) 8 Monge-Elkan 0.698 0.986 0.817 0.958 0.956 0.957

• Publication individuals– Ontological restrictions mainly influence

precision

36

Experiments

Dataset No Matcher Person

Prec. Recall F1 Prec. Recall F1

AKT/Rexa 7 L2 Jaro-Winkler 0.738 0.888 0.806 0.788 0.935 0.855

AKT/DBLP 8 L2 Jaro-Winkler 0.532 0.746 0.621 0.583 0.921 0.714

Rexa/DBLP 9 Jaro-Winkler 0.965 0.755 0.846 0.968 0.876 0.920

Cora (I) 10 L2 Jaro-Winkler 0.983 0.879 0.928 0.981 0.895 0.936

Cora (II) 11 L2 Jaro-Winkler 0.999 0.994 0.997 0.999 0.994 0.997

• Person individuals– Evidence coming from the neighborhood graph– Mainly influences recall

37

Outline





38

Advanced scenario

• Linked Data cloud: network of public RDF repositories [Bizer et al. 2009]

• Added value: coreference links (owl:sameAs)

39

Data linking: current state

• Automatic instance matching algorithms– SILK, ODDLinker, KnoFuss, …

• Pairwise matching of datasets– Requires significant configuration

effort

• Transitive closure of links– Use of “reference” datasets

40

Reference datasets

41

Problems

• Transitive closures often incomplete– Reference dataset is incomplete– Missing intermediate links– Direct comparison of relevant datasets is

desirable

• Schema heterogeneity– Which instances to compare?– Which properties are relevant

A BReference

42

Schema matching

• Interpretation mismatches– dbpedia:Actor = professional actor– movie:actor = anybody who participated in a movie

• Class interpretation “as used” vs “as designed”– FOAF: foaf:Person = any person– DBLP: foaf:Person = computer scientist

• Instance-based ontology matching

Repository Richard Nixon David Garrick

dbpedia:Actor DBPedia - +

movie:Actor LinkedMDB + -

43

KnoFuss - enhanced

Knowledge fusion

Ontology integration

Knowledge base

integration

Ontology matching

Instancetransformation

Coreferenceresolution

Dependency resolution

Source KB

TargetKB

SPARQL query translation

44

Schema matching

• Step 1: inferring schema mappings from pre-existing instance mappings

• Step 2: utilizing schema mappings to produce new instance mappings

Ontology 1 Ontology 2

Dataset 1 Dataset 2

Ontology 1 Ontology 2

Dataset 1 Dataset 2

45

• Background knowledge:– Data-level

(intermediate repositories)

– Schema-level (datasets with more fine-grained schemas)

Overview

46

Algorithm

• Step 1:– Obtaining transitive closure of

existing mappings

LinkedMDB DBPedia

movie:music_contributor/2490

MusicBrainz

music:artist/a16…9fdf

= =

dbpedia:Ennio_Morricone

47

Algorithm

• Step 2: Inferring class and property mappings– ClassOverlap and PropertyOverlap mappings– Confidence (classes A, B) = |c(A)Åc(B)| / min(c(|A|), c(|B|))

(overlap coefficient)– Confidence (properties r1, r2) = |c(X)|/||c(Y)|

• X – identity clusters with equivalent values of r1 and r2• Y – all identity clusters which have values for both r1 and r2

LinkedMDB DBPediaMusicBrainz

music:artist/a16…9fdf

==

dbpedia:Ennio_Morriconemovie:music_contributor/2490

movie:music_contributor dbpedia:Artist

is_a is_a

48

• Step 3: Inferring data patterns

• Functionality restrictions

• IF 2 equivalent movies do not have overlapping actors AND have different release dates THEN break the equivalence link

• Note:– Only usable if not taken

into account at the initial instance matching stage

Algorithm

49

Algorithm

• Step 4: utilizing mappings and patterns– Run instance-level matching for individuals

of strongly overlapping classes– Use patterns to filter out existing mappings

• LinkedMDB

SELECT ?uri

WHERE {

?uri rdf:type movie:music_contributor .

}

• DBPediaSELECT ?uriWHERE { ?uri rdf:type

dbpedia:Artist . }

50

Results

• Class mappings:– Improvement in recall

• Previously omitted mappings were discovered after direct comparison of instances

• Data patterns– Improved precision

• Filtered out spurious mappings• Identified 140 mappings

between movies as “potentially spurious”

• 132 identified correctly

00.10.20.30.40.50.60.70.80.9

1

Existing KnoFuss(only)

Combined

PrecisionRecallF1-measure

00.10.20.30.40.50.60.70.80.9

1


Combined


00.10.20.30.40.50.60.70.80.9

1


Combined


DBPedia/

DBLP

DBPedia/

LinkedMDB

DBPedia/

BookMashup

51

Future work

• From the pairwise scenario to the network of repositories

• Combining schema and data integration in an efficient way

• Evaluating data sources– Which data source(s) to link to?– Which data source(s) to select data

from?

52

Questions?

Thanks for your attention

53

References

[Shenoy and Shafer 1990] P. Shenoy, G. Shafer. Axioms for probability and belief-function propagation. In: Readings in uncertain reasoning. San Francisco: Morgan Kaufmann, pp. 575-610, 1990

[Motta 1999] E. Motta. Reusable components for knowledge modelling. Amsterdam: IOS Press, 1999

[Bizer et al 2009] C. Bizer, T. Heath, T. Berners-Lee. Linked Data - the story so far. International Journal on Semantic Web and Information Systems 5(3), pp. 1-22, 2009

[Fellegi and Sunter 1969] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of American Statistical Association, 64(328):1183-1210, 1969

[Reiter 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57-95, 1987

fusing semantic data

Education

address email

article sweto

schmidt address year

publication sweto

type sweto

person sweto

title label journal

y sweto