fusing semantic data
TRANSCRIPT
Fusing automatically extracted annotations for the Semantic
WebAndriy NikolovKnowledge Media InstituteThe Open University
2
Outline
• Motivation• Handling fusion subtasks
– problem-solving method approach
• Processing inconsistencies– applying uncertainty reasoning
• Overcoming schema heterogeneity– Linked Data scenario
3
Database scenario
• Classical scenario (database domain)– Merging information from datasets
containing partially overlapping information
Name Year of birth E-mail Address
H. Schmidt 1972 [email protected] …
J. Smith 1983 [email protected] …
Name Year of birth E-Mail Job position
Wen, Zhao 1980 [email protected] …
Schmidt, Hans 1973 [email protected] …
4
Database scenario
• Coreference resolution (record linkage)– Resolving ambiguous identities
Name Year of birth E-mail Address
H. Schmidt 1972 [email protected] …
J. Smith 1983 [email protected] …
Name Year of birth E-Mail Job position
Wen, Zhao 1980 [email protected] …
Schmidt, Hans 1973 [email protected] …
5
Database scenario
• Inconsistency resolution– Handling contradictory pieces of data
Name Year of birth E-mail Address
H. Schmidt 1972 [email protected] …
J. Smith 1983 [email protected] …
Name Year of birth E-Mail Job position
Wen, Zhao 1980 [email protected] …
Schmidt, Hans 1973 [email protected] …
6
Semantic data scenario
• Database domain:– A record belongs to a single
table– Table structure defines
relevant attributes– Inconsistency of values
• Semantic data:– Classes are organised into
hierarchies– One individual may belong
to several classes– Available properties depend
on the level of granularity– Other types of
inconsistencies are possible• E.g., class disjointness
foaf:Person
sweto:Person
foaf:namexsd:string
xsd:stringfoaf:mbox
sweto:Placesweto:lives_in
sweto:Organization
sweto:affiliated_with
sweto:Researcherxsd:stringsome:has_degree
7
Motivating scenario – X-Media
RDF
Images
Other data
Annotation FusionText
Internal corporate reports (Intranet)
Pre-defined public sources (WWW)
Domain ontology
KnoFuss
Knowledge base
8
Outline
• Motivation• Handling fusion subtasks
– problem-solving method approach
• Processing inconsistencies– applying uncertainty reasoning
• Overcoming schema heterogeneity– Linked Data scenario
9
Handling fusion subtasks
• For each subtask, several available methods exist
• Example: coreference resolution– Aggregated attribute similarity
• [Fellegi&Sunter 1969]
– String similarity• Levenshtein, Jaro, Jaro-Winkler
– Machine learning• Clustering• Classification
– Rule-based
10
Handling fusion subtasks
• All methods have their pros and cons– Rule-based
• High precision• Restricted to a specific domain
– Machine learning• Require sufficient training data
– String similarity• Lower precision• Still need configuration (e.g., distance metric, threshold, set of
attributes to include)
• Trade-off between the quality of results and applicability range – better precision requires more domain-specific knowledge
11
Problem-solving method approach
• Fusion task is decomposed into subtasks• Algorithms defined as methods solving a particular
task• Each method is formally described using the fusion
ontology– Task handled by the method– Applicability criteria– Domain knowledge required– Reliability of output
• Methods are selected based on their capabilities
12
KnoFuss architecture
Fusion KBIntermediate data
Main KB
KnoFuss
CoreferenceResolutionMethod
ConflictDetectionMethod
ConflictResolutionMethod
Method library
New data
Fusion ontology
• Method library– Contains implementation of each technique for specific
subtasks (problem-solving method [Motta 1999])• Fusion ontology
– Describes method capabilities– Defines intermediate structures (mappings, conflict sets, etc.)
13
Task decomposition
Knowledge fusion
Coreferenceresolution
Knowledge base
updating
Modelconfiguration
Dependency identification
Dependency resolution
Linkdiscovery
Source KB
TargetKB
(fused)
TargetKB
14
Method selection
Adaptive learning matcher Application context:
Publication
Application context:Journal Article
rdf:type owl:Thing
datatypeProperty ?x
reliability =0.4
rdf:type sweto:Publication
rdfs:label ?x
sweto:year ?y
reliability =0.8
rdf:type sweto:Article
rdfs:label ?x
sweto:year ?y
sweto:journal ?z
sweto:volume ?a
reliability =0.9
• Depends on:– Range of applicability– Reliability
• Configuration parameters– Generic (for individuals of unknown types)– Context-dependent
15
Using class hierarchy
• Configuring machine-learning methods:– Using training instances for a subclass to learn
generic models for superclasses
owl:Thing
foaf:Person foaf:Document
sweto:Publication
sweto:Article sweto:Article_in_Proceedings
year
name
volume book_title
label
journal_name
Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}
Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}
Ind1: {label}Ind2: {label}Ind3: {label}
16
Using class hierarchy
• Configuring machine-learning methods:– Combining training instances for subclasses to
learn a generic model for a superclasssweto:Publication
sweto:Article sweto:Article_in_Proceedings
year
volumebook_title
label
journal_name
Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}
Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}Ind4: {label, year}Ind5: {label, year}Ind6: {label, year}
Ind4: {label, year, journal_name, volume}Ind5: {label, year, journal_name, volume}Ind6: {label, year, journal_name, volume}
17
Outline
• Motivation• Handling fusion subtasks
– problem-solving method approach
• Processing inconsistencies– applying uncertainty reasoning
• Overcoming schema heterogeneity– Linked Data scenario
18
Data quality problems
• Causes of inconsistency– Data errors
• Obsolete data• Mistakes of manual annotators• Errors of information extraction algorithms
– Coreference resolution errors• Automatic methods not 100% reliable
• Applying uncertainty reasoning– Estimated reliability of separate pieces of
data– Domain knowledge defined in the ontology
19
Refining fused data
• Additional evidence:– Ontological schema restrictions
• Disjointness• Cardinality• …
– Neighborhood graph• Mappings between related entities
– Provenance• Uncertainty of candidate mappings• Uncertainty of data statements• “Cleanness” of data sources
20
Dempster-Shafer theory of evidence
• Bayesian probability theory:Assigning probabilities to atomic alternatives: – p(true)=0.6 ! p(false)=0.4 – Sometimes hard to assign– Negative bias:
Extraction uncertainty less than 0.5 – negative evidence rather than insufficient evidence
Dempster-Shafer theory: Assigning confidence degrees (masses) to sets of alternatives– m({true}) = 0.6– m({false}) = 0.1– m({true;false})=0.3
probability
support
plausibility
21
Dependency detection
• Identifying and localizing conflicts– Using formal diagnosis [Reiter 1987] in
combination with standard ontological reasoning
ArticleArticle ProceedingsProceedings
Paper_10Paper_10
owl:disjointWithowl:disjointWith owl:FunctionalPropertyowl:FunctionalProperty
rdf:typerdf:type
hasYear2007
hasYear 2006
E. MottaE. Motta V.S. UrenV.S. UrenhasAuthor hasAuthor
22
Belief networks (cont)
• Valuation networks [Shenoy and Shafer 1990]
• Network nodes – OWL axioms– Variable nodes
• ABox statements (I2X, R(I1, I2))
• One variable – the statement itself
– Valuation nodes• TBox axioms (XtY)• Mass distribution between several variables
(I2X, I2Y, I2XtY)
23
Belief networks (cont)
• Belief network construction– Using translation rules– Rule antecedents:
• Existence of specific OWL axioms (one rule per OWL construct)
• Existence of network nodes– Example rule:
• Explicit ABox statements:IF I2X THEN CREATE N1(I2X)
• TBox inferencing:IF Trans(R) AND EXIST N1(R(I1, I2)) AND EXIST N2(R(I2, I3)) THEN CREATE N3(Trans(R)) AND CREATE N4(R(I1, I3))
24
Example
#Paper_10#Paper_10
ArticleArticle ProceedingsProceedings
owl:disjointWithowl:disjointWith
rdf:type rdf:type
25
Example
#Paper_10#Paper_10
ArticleArticle ProceedingsProceedings
owl:disjointWithowl:disjointWith
rdf:type rdf:type
#Paper_102Article
#Paper_102Proceedings
26
Example
#Paper_10#Paper_10
ArticleArticle ProceedingsProceedings
owl:disjointWithowl:disjointWith
rdf:type rdf:type
#Paper_102Article
#Paper_102Proceedings
Article v :Proceedings
27
Example
#Paper_102Article
#Paper_102Proceedings
Article v :Proceedings
m(true)=0.8m(false) = 0
m({true;false})=0.2
m(true)=0.6m(false) = 0
m({true;false})=0.4
#Paper_102 Article
#Paper_102 Proceedings
false false
false true
true false
true true
m( )=1.0
m( )=0.0
28
Example
#Paper_102Article
#Paper_102Proceedings
Article v :Proceedings
m(true)=0.8m(false) = 0
m({true;false})=0.2
m(true)=0.6m(false) = 0
m({true;false})=0.4
#Paper_102 Article
#Paper_102 Proceedings
false false
false true
true false
false true
true false
m( )=0.15 -Dempster’s rule
m(
m(
)=0.23
)=0.62
29
Example
#Paper_102Article
#Paper_102Proceedings
Article v :Proceedings
m(true)=0.62m(false) = 0.23
m({true;false})=0.15
m(true)=0.23m(false) = 0.62
m({true;false})=0.15
#Paper_102 Article
#Paper_102 Proceedings
false false
false true
true false
false true
true false
m( )=0.15
m(
m(
)=0.23
)=0.62
30
Belief propagation
• Translating subontology into a belief network– Using provenance and confidence values of data statements– Coreferencing algorithm precision for owl:sameAs mappings
• Data refinement:– Detecting spurious mappings– Removing unreliable data statements
Articlev:in_Proc
Ind1=Ind2
Functional(year)
Article(Ind1)
(0.99;1.0)/(0.97;0.98)
in_Proc(Ind2) Ind1=Ind2
inProc(Ind1)
Ind1=Ind2
year(Ind1, 2007)
year(Ind2, 2007)
year(Ind1, 2006)
(0.9;1.0)/(0.74;0.82) (0.92;1.0)/(0.2;0.21) (0.85;1.0)/(0.72;0.85)
(0.95;1.0)/(0.91;0.96)
31
Neighbourhood graph
• Non-functional relations: varying impact
Paper_10Paper_10 H. SchmidtH. SchmidthasAuthor
Paper_11Paper_11 Schmidt, HansSchmidt, Hans
owl:sameAs (0.9) owl:sameAs (0.3)
hasAuthor
ProceedingsProceedings
rdf:type
rdf:type
PersonPerson
GermanyGermany H. SchmidtH. Schmidtcitizen_of
GermanyGermany Schmidt, HansSchmidt, Hans
owl:sameAs (1.0) owl:sameAs (0.3)
citizen_of
CountryCountry
rdf:type
rdf:type
PersonPerson
32
Neighborhood graph
• Implicit relations: set co-membership
Person11 = Person12
Person21 = Person22
Coauthor(Person12, Person22)
Person11 = Person12
Coauthor(Person11, Person22)
Person21 = Person22
Coauthor(Person21, Person22)
“Bard, J.B.L.”=“Jonathan Bard”
“Webber, B.L.”=“Bonnie L. Webber”
0.84/(0.86;1.0)
0.16/(0.83;1.0)1.0/(1.0;1.0)
1.0/(1.0;1.0)
33
Provenance
• Initial belief assignments:– Data statements
(source AND/OR extractor confidence)
– Candidate mappings (precision of attribute similarity algorithms)
– Source “cleanness” – contains duplicates or not
Arl_Va Arl_Tx
Arlington = Arl_Tx
Arlington Arl_Tx
Arlington = Arl_Va
Arlington = Arl_Va
Arl_Va Arl_Tx
1.0/(1.0;1.0)
0.9/(0.31;0.35)
Arlington, Virginia
0.95/(0.65;0.69)
Arlington, Texas
34
Experiments
• Datasets:– Publication 1
• AKT• Rexa• SWETO-DBLP
– Cora• database community benchmark• translated into RDF• 2 versions used
– different structure – different gold standard
35
Experiments
Dataset No Matcher Publication
Prec. Recall F1 Prec. Recall F1
AKT/Rexa 1 Jaro-Winkler 0.950 0.833 0.887 0.969 0.832 0.895
2 L2 Jaro-Winkler 0.879 0.956 0.916 0.923 0.956 0.939
AKT/DBLP 3 Jaro-Winkler 0.922 0.952 0.937 0.992 0.952 0.971
4 L2 Jaro-Winkler 0.389 0.984 0.558 0.838 0.983 0.905
Rexa/DBLP 5 Jaro-Winkler 0.899 0.933 0.916 0.944 0.932 0.938
6 L2 Jaro-Winkler 0.546 0.982 0.702 0.823 0.981 0.895
Cora (I) 7 Monge-Elkan 0.735 0.931 0.821 0.939 0.836 0.884
Cora (II) 8 Monge-Elkan 0.698 0.986 0.817 0.958 0.956 0.957
• Publication individuals– Ontological restrictions mainly influence
precision
36
Experiments
Dataset No Matcher Person
Prec. Recall F1 Prec. Recall F1
AKT/Rexa 7 L2 Jaro-Winkler 0.738 0.888 0.806 0.788 0.935 0.855
AKT/DBLP 8 L2 Jaro-Winkler 0.532 0.746 0.621 0.583 0.921 0.714
Rexa/DBLP 9 Jaro-Winkler 0.965 0.755 0.846 0.968 0.876 0.920
Cora (I) 10 L2 Jaro-Winkler 0.983 0.879 0.928 0.981 0.895 0.936
Cora (II) 11 L2 Jaro-Winkler 0.999 0.994 0.997 0.999 0.994 0.997
• Person individuals– Evidence coming from the neighborhood graph– Mainly influences recall
37
Outline
• Motivation• Handling fusion subtasks
– problem-solving method approach
• Processing inconsistencies– applying uncertainty reasoning
• Overcoming schema heterogeneity– Linked Data scenario
38
Advanced scenario
• Linked Data cloud: network of public RDF repositories [Bizer et al. 2009]
• Added value: coreference links (owl:sameAs)
39
Data linking: current state
• Automatic instance matching algorithms– SILK, ODDLinker, KnoFuss, …
• Pairwise matching of datasets– Requires significant configuration
effort
• Transitive closure of links– Use of “reference” datasets
40
Reference datasets
41
Problems
• Transitive closures often incomplete– Reference dataset is incomplete– Missing intermediate links– Direct comparison of relevant datasets is
desirable
• Schema heterogeneity– Which instances to compare?– Which properties are relevant
A BReference
42
Schema matching
• Interpretation mismatches– dbpedia:Actor = professional actor– movie:actor = anybody who participated in a movie
• Class interpretation “as used” vs “as designed”– FOAF: foaf:Person = any person– DBLP: foaf:Person = computer scientist
• Instance-based ontology matching
Repository Richard Nixon David Garrick
dbpedia:Actor DBPedia - +
movie:Actor LinkedMDB + -
43
KnoFuss - enhanced
Knowledge fusion
Ontology integration
Knowledge base
integration
Ontology matching
Instancetransformation
Coreferenceresolution
Dependency resolution
Source KB
TargetKB
SPARQL query translation
44
Schema matching
• Step 1: inferring schema mappings from pre-existing instance mappings
• Step 2: utilizing schema mappings to produce new instance mappings
Ontology 1 Ontology 2
Dataset 1 Dataset 2
Ontology 1 Ontology 2
Dataset 1 Dataset 2
45
• Background knowledge:– Data-level
(intermediate repositories)
– Schema-level (datasets with more fine-grained schemas)
Overview
46
Algorithm
• Step 1:– Obtaining transitive closure of
existing mappings
LinkedMDB DBPedia
movie:music_contributor/2490
MusicBrainz
music:artist/a16…9fdf
= =
dbpedia:Ennio_Morricone
47
Algorithm
• Step 2: Inferring class and property mappings– ClassOverlap and PropertyOverlap mappings– Confidence (classes A, B) = |c(A)Åc(B)| / min(c(|A|), c(|B|))
(overlap coefficient)– Confidence (properties r1, r2) = |c(X)|/||c(Y)|
• X – identity clusters with equivalent values of r1 and r2• Y – all identity clusters which have values for both r1 and r2
LinkedMDB DBPediaMusicBrainz
music:artist/a16…9fdf
==
dbpedia:Ennio_Morriconemovie:music_contributor/2490
movie:music_contributor dbpedia:Artist
is_a is_a
48
• Step 3: Inferring data patterns
• Functionality restrictions
• IF 2 equivalent movies do not have overlapping actors AND have different release dates THEN break the equivalence link
• Note:– Only usable if not taken
into account at the initial instance matching stage
Algorithm
49
Algorithm
• Step 4: utilizing mappings and patterns– Run instance-level matching for individuals
of strongly overlapping classes– Use patterns to filter out existing mappings
• LinkedMDB
SELECT ?uri
WHERE {
?uri rdf:type movie:music_contributor .
}
• DBPediaSELECT ?uriWHERE { ?uri rdf:type
dbpedia:Artist . }
50
Results
• Class mappings:– Improvement in recall
• Previously omitted mappings were discovered after direct comparison of instances
• Data patterns– Improved precision
• Filtered out spurious mappings• Identified 140 mappings
between movies as “potentially spurious”
• 132 identified correctly
00.10.20.30.40.50.60.70.80.9
1
Existing KnoFuss(only)
Combined
PrecisionRecallF1-measure
00.10.20.30.40.50.60.70.80.9
1
Existing KnoFuss(only)
Combined
PrecisionRecallF1-measure
00.10.20.30.40.50.60.70.80.9
1
Existing KnoFuss(only)
Combined
PrecisionRecallF1-measure
DBPedia/
DBLP
DBPedia/
LinkedMDB
DBPedia/
BookMashup
51
Future work
• From the pairwise scenario to the network of repositories
• Combining schema and data integration in an efficient way
• Evaluating data sources– Which data source(s) to link to?– Which data source(s) to select data
from?
52
Questions?
Thanks for your attention
53
References
[Shenoy and Shafer 1990] P. Shenoy, G. Shafer. Axioms for probability and belief-function propagation. In: Readings in uncertain reasoning. San Francisco: Morgan Kaufmann, pp. 575-610, 1990
[Motta 1999] E. Motta. Reusable components for knowledge modelling. Amsterdam: IOS Press, 1999
[Bizer et al 2009] C. Bizer, T. Heath, T. Berners-Lee. Linked Data - the story so far. International Journal on Semantic Web and Information Systems 5(3), pp. 1-22, 2009
[Fellegi and Sunter 1969] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of American Statistical Association, 64(328):1183-1210, 1969
[Reiter 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57-95, 1987