computing foaf co-reference relations with rules and machine learning jennifer sleeman and tim finin...
Post on 19-Dec-2015
218 views
TRANSCRIPT
Computing FOAF Co-reference Relations with Rules and Machine Learning
Jennifer Sleeman and Tim FininUniversity of Maryland, Baltimore County
The Third International Workshop on Social Data on the Web, November 2010
http://ebiquity.umbc.edu/paper/html/id/506/
FOAF
Friend of a Friend (FOAF) vocabulary describes people and their relationships One of oldest and most widely used ontologies
Does not include a globally unique identifier Inverse functional properties (IFPs) help
Multiple foaf instances referring to the same person are common Increasingly so with more linked data
introduction foaf co-reference approach methodology evaluation conclusions
Linking dataData integration requires linking instances
from different data setsLinking foaf instances is a common and
typical use caseSindice reports 23 foaf instances all referring
to Sir Tim Berners LeeProbably more than my query revealedOnly a handful are linked via owl:sameAsAutomatically linking foaf instances is not
always easy
introduction foaf co-reference approach methodology evaluation conclusions
Example 1<swivt:Subject rdf:about="http://tw.rpi.edu/wiki/Special:URIResolver/Bijan_Parsia"><rdfs:label>Bijan Parsia</rdfs:label><swivt:page rdf:resource="http://tw.rpi.edu/wiki/Bijan_Parsia"/><rdfs:isDefinedBy rdf:resource="http://tw.rpi.edu/wiki/Special:ExportRDF/Bijan_Parsia"/><rdf:type rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3APerson"/><property:Foaf-3Adepiction rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Anonymous.png"/><foaf:firstName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Bijan</foaf:firstName><foaf:interest rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3ASemantic_Web_Topic"/><foaf:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Bijan Parsia</foaf:name><foaf:surname rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Parsia</foaf:surname><property:Has_affiliation rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Manchester_University"/><property:Has_identifier rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Bijan_Parsia"/></swivt:Subject>
http://tw.rpi.edu/wiki/Special:ExportRDF/Bijan_Parsia
<foaf:Person rdf:ID="bparsia"> <foaf:mbox_sha1sum>f49a6854842c5fa76dc0edb8e82f8fe04fd56bc9</foaf:mbox_sha1sum> <foaf:firstName>Bijan</foaf:firstName> <foaf:surname>Parsia</foaf:surname> <foaf:name>Bijan Parsia</foaf:name> <foaf:homepage rdf:resource="http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=bparsia"/> <foaf:img rdf:resource="http://www.mindswap.org/~bparsia/talks/uri-use/bijan.jpg"/> <foaf:depiction rdf:resource="http://www.mindswap.org/~bparsia/talks/uri-use/bijan.jpg"/> <foaf:nick>bparsia</foaf:nick> <foaf:holdsAccount> <foaf:OnlineAccount> <foaf:accountName>bparsia</foaf:accountName> <foaf:accountServiceHomepage rdf:resource="http://trust.mindswap.org/FilmTrust/"/> </foaf:OnlineAccount> </foaf:holdsAccount>
http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=bparsia#tt0084827-bparpia
Common properties but can wesay this is the same person…
Example 2<foaf:Person>
<foaf:name>James A. Hendler</foaf:name>
<foaf:firstName>James</foaf:firstName>
<foaf:surname>Hendler</foaf:surname>
<foaf:publications>http://ebiquity.umbc.edu/papers/select/person/James/Hendler/</foaf:publications>
<foaf:homepage rdf:resource="http://www.cs.umd.edu/~hendler/"/>
<foaf:workInfoHomepage rdf:resource="http://www.cs.umd.edu/~hendler/"/>
http://ebiquity.umbc.edu/person/foaf/James/A./Hendler/foaf.rdf
<foaf:Person rdf:ID="jhendler"> <foaf:mbox_sha1sum>0b62d4242736e64be6138547c79a811b3e82fd52</foaf:mbox_sha1sum> <foaf:firstName>Jim</foaf:firstName> <foaf:surname>Hendler</foaf:surname> <foaf:name>Jim Hendler</foaf:name> <foaf:title>Tetherless World Constellation Chair</foaf:title> <foaf:homepage rdf:resource="http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=jhendler"/> <foaf:homepage rdf:resource="http://www.cs.umd.edu/~hendler"/> <foaf:depiction rdf:resource="http://www.semanticgrid.org/q-iantbljim.jpg"/> <foaf:workplaceHomepage rdf:resource="http://owl.mindswap.org"/> <foaf:img rdf:resource="http://www.cs.umd.edu/~hendler/hendler.gif"/> <foaf:depiction rdf:resource="http://www.cs.umd.edu/~hendler/hendler.gif"/> <foaf:nick>jhendler</foaf:nick> <foaf:openID rdf:resource="http://jhendler.pip.verisignlabs.com/" /> <foaf:holdsAccount> <foaf:OnlineAccount> <foaf:accountName>jhendler</foaf:accountName> <foaf:accountServiceHomepage rdf:resource="http://trust.mindswap.org/FilmTrust/"/> </foaf:OnlineAccount> </foaf:holdsAccount>
http://www.cs.rpi.edu/~hendler/foaf.rdf
Aliases and slight namevariations…
Example 3<Agent rdf:about="http://identi.ca/user/53505"><mbox_sha1sum>08445a31a78661b5c746feff39a9db6e4e2cc5cf</mbox_sha1sum><name>David Wood</name><homepage rdf:resource="http://dw2-0.com"/><weblog rdf:resource="http://identi.ca/dw2"/><holdsAccount><OnlineAccount rdf:about="http://identi.ca/user/53505#acct"><accountServiceHomepage rdf:resource="http://identi.ca/"/><accountName>dw2</accountName><accountProfilePage rdf:resource="http://identi.ca/dw2"/><sioc:account_of rdf:resource="http://identi.ca/user/53505"/><sioc:follows rdf:resource="http://identi.ca/user/136#acct"/></OnlineAccount></holdsAccount>
http://identi.ca/dw2/foaf
<foaf:Person rdf:about="http://zepheira.com/team/dave/#me"> <foaf:name>David Wood</foaf:name> <foaf:title>Dr.</foaf:title> <foaf:givenname>David</foaf:givenname> <foaf:family_name>Wood</foaf:family_name> <foaf:nick>prototypo</foaf:nick> <foaf:mbox_sha1sum>37c8d030d4e615d05f31625b3460532a3f4e214e</foaf:mbox_sha1sum> <foaf:homepage rdf:resource="http://prototypo.blogspot.com/"/> <foaf:depiction rdf:resource="http://www.itee.uq.edu.au/~dwood/images/dave_w_0.jpg"/> <foaf:phone rdf:resource="tel:+1-(571)-331-3723"/> <foaf:workplaceHomepage rdf:resource="http://www.zepheira.com/"/> <foaf:workInfoHomepage rdf:resource="http://www.zepheira.com/team/dave"/> <foaf:schoolHomepage rdf:resource="http://www.vmi.edu/"/> <foaf:schoolHomepage rdf:resource="http://www.nps.navy.mil/"/> <foaf:schoolHomepage rdf:resource="http://www.itee.uq.edu.au/"/> <foaf:aimChatID>piprototypo</foaf:aimChatID>
http://www.itee.uq.edu.au/~dwood/dave.rdf#me
What if mbox_sha1sums aredifferent?
Example 3 cont.
<ms:Researcher rdf:ID="David_Wood" rdfs:label="David Wood"><foaf:name>David Wood</foaf:name><foaf:mbox><owl:Thing rdf:about="mailto:[email protected]"/></foaf:mbox><foaf:homepage><foaf:Document rdf:about="http://www.mindswap.org/~dwood/"/></foaf:homepage><foaf:workInfoHomepage><foaf:Document rdf:about="http://www.mindswap.org/~dwood/"/></foaf:workInfoHomepage></ms:Researcher>
http://www.mindswap.org/2004/owl/mindswappers#David.Wood
Which David Wood was amindswapper?
Example 5<foaf:Person rdf:ID="jgolbeck"> <foaf:mbox_sha1sum>08445a31a78661b5c746feff39a9db6e4e2cc5cf</foaf:mbox_sha1sum> <foaf:firstName></foaf:firstName> <foaf:surname></foaf:surname> <foaf:name> </foaf:name> <foaf:homepage rdf:resource="http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=jgolbeck"/> <foaf:img rdf:resource=""/> <foaf:depiction rdf:resource=""/> <foaf:nick>jgolbeck</foaf:nick> <foaf:holdsAccount> <foaf:OnlineAccount> <foaf:accountName>jgolbeck</foaf:accountName> <foaf:accountServiceHomepage rdf:resource="http://trust.mindswap.org/FilmTrust/"/> </foaf:OnlineAccount> </foaf:holdsAccount>
http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=jgolbeck
<swivt:Subject rdf:about="http://tw.rpi.edu/wiki/Special:URIResolver/Jennifer_Golbeck">
<rdfs:label>Jennifer Golbeck</rdfs:label>
<swivt:page rdf:resource="http://tw.rpi.edu/wiki/Jennifer_Golbeck"/>
<rdfs:isDefinedBy rdf:resource="http://tw.rpi.edu/wiki/Special:ExportRDF/Jennifer_Golbeck"/>
<rdf:type rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3AAssistant_Professor"/>
<rdf:type rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3APerson"/>
<property:Foaf-3Adepiction rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Anonymous.png"/>
<foaf:firstName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Jennifer</foaf:firstName>
<foaf:interest rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3ASemantic_Web_Topic"/>
<foaf:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Jennifer Golbeck</foaf:name>
<foaf:surname rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Golbeck</foaf:surname>
http://tw.rpi.edu/wiki/Special:ExportRDF/Jennifer_Golbeck
Could jgolbeck and Jennifer Golbeck be the same person …
Example 5 cont.<rdf:RDF><foaf:Person><foaf:name>Jennifer Golbeck</foaf:name><foaf:mbox rdf:resource="mailto:[email protected]"/> <foaf:mbox rdf:resource="mailto:[email protected]"/><owl:sameAs rdf:resource="http://www.mindswap.org/2004/owl/mindswappers#Jennifer.Golbeck"/><foaf:workplaceHomepage rdf:resource="http://www.cs.umd.edu/~golbeck"/><foaf:currentProject rdf:resoruce="http://trust.mindswap.org"/><foaf:publications rdf:resource="http://www.mindswap.org/papers"/><foaf:knows rdf:resource="#danbri"/><rdfs:seeAlso rdf:resource="http://trust.mindswap.org/cgi-bin/getList.cgi"/>
http://www.cs.umd.edu/~golbeck/daml/golbeckFOAF.rdf
<ms:Researcher rdf:ID="Jennifer.Golbeck" rdfs:label="Jennifer Golbeck">
<rdfs:seeAlso rdf:resource="http://www.cs.umd.edu/~golbeck/daml/golbeckFOAF.rdf"/>
<foaf:name>Jennifer Golbeck</foaf:name>
<foaf:mbox><owl:Thing rdf:about="mailto:[email protected]"/></foaf:mbox>
<foaf:homepage><foaf:Document rdf:about="http://www.cs.umd.edu/~golbeck/"/></foaf:homepage>
<foaf:workInfoHomepage><foaf:Document rdf:about="http://www.mindswap.org/~golbeck/"/>
</foaf:workInfoHomepage>
</ms:Researcher>
http://www.mindswap.org/2004/owl/mindswappers#Jennifer.Golbeck
Which profile is most recent/relevant?
Our Contributions
Treating foaf smushing as entity co-referenceUse machine learning to train a classifier for
recognizing co-referent foaf instanceCombine this with rule-based evidenceUse of narrower RDF properties to express co-
reference, avoiding overuse of owl:sameAs Use of a greedy algorithm for iteratively clustering
co-referent entities and re-evaluating their potential co-reference relations
introduction foaf co-reference approach methodology evaluation conclusions
Co-Reference in FOAF
Approach problem like cross-document co-reference resolution in text
Match pairs FOAF agentsUse rules and propertiesAssign new properties to represent coref
and notCoref relationshipsCluster co-referent pairs
introduction foaf co-reference approach methodology evaluation conclusions
Cross-Document Co-reference Resolution
Determine when two documents mentionthe same entity
Are two documents that talk about “George Bush” talking about the same George Bush?Is a document mentioning “Mahmoud Abbas” referring to the same person as one mentioning “Muhammed Abbas”? What about “Abu Abbas”? “Abu Mazen”?
Drawing appropriate inferences frommultiple documents demands cross-document co-reference resolution
2008 NIST Text Analysis Conference
TAC KBP: Entity LinkingJohn Williams
Richard Kaufman goes a long way back with John Williams. Trained as a classical violinist, Californian Kaufman started doing session work in the Hollywood studios in the 1970s. One of his movies was Jaws, with Williams conducting his score in recording sessions in 1975...
John Williams author 1922-1994
J. Lloyd Williams botanist 1854-1945
John Williams politician 1955-
John J. Williams US Senator 1904-1988
John Williams Archbishop 1582-1650
John Williams composer 1932-
Jonathan Williams poet 1929-
Michael Phelps
Debbie Phelps, the mother of swimming star Michael Phelps, who won a record eight gold medals in Beijing, is the author of a new memoir, ...
Michael Phelps swimmer 1985-
Michael Phelps biophysicist 1939-
Michael Phelps is the scientist most often identified as the inventor of PET, a technique that permits the imaging of biological processes in the organ systems of living individuals. Phelps has ...
Given an entity mention in an article, find the link to the right Wikipedia entity if one exists.
2009 NIST TAC Knowledge Base Population Track
Smushing
Smushing is the traditional term used for recognizing that two “blank nodes” refer to the same thing and merging them
Past work on smushing has exploited IFPs (e.g., foaf:mbox), heuristic similarity metrics and custom SPARQL queries
owl:sameAs is often used to relate smushed nodes, enabling a reasoner to effect the merging
rdf:seeAlso used to find related foaf data
introduction foaf co-reference approach methodology evaluation conclusions
Smushing
introduction foaf co-reference approach methodology evaluation conclusions
foaf:Person
rdfs:type
foaf:mbox
foaf:knowsfoaf:nick”bar"
owl:sameAs
foaf:mbox
Smushing
introduction foaf co-reference approach methodology evaluation conclusions
foaf:Person
rdfs:type
foaf:knowsfoaf:nick”bar"
foaf:mbox
owl:sameAs considered harmful
Known problems– Temporally qualified data (Ding vs. Ding)– Noisy data (Clinton vs. Clinton)– Referentially opaque contexts (John likes the
Morning Star beautiful)Halpin et. Al (2010) suggest a vocabulary for
similarity relations similarity.owlWe use two weaker predicates: coref & notCoref– Defer the sameAs problem to applications
introduction foaf co-reference approach methodology evaluation conclusions
Co-Reference in FOAF
coref: transitive, symmetric and reflexive; has sameAs as subproperty
notCoref: symmetric and irreflexive but not transitive; has differentFrom as subproperty
:coref a owl:TransitiveProperty, owl:SymmetricProperty, owl:ReflexivePropertyowl:sameAs rdfs:subPropertyOf :coref.:notCoref a owl:SymmetricProperty, owl:IrreflexiveProperty.owl:differentFrom rdfs:subPropertyOf :notCoref.{?a :notCoref ?b. ?b :coref ?c.} => {?a :notCoref ?c}{?a foaf:knows ?b.} => {?a :notCoref ?b}
The :coref and :notCoref properties that we use instead of owl:sameAs
introduction foaf co-reference approach methodology evaluation conclusions
Batch Approach
Given a potentially large set of foaf instancesGenerate candidate pairsEvaluate each pair for co-reference
Using rules and classifier independentlyEach results in a {coref, notCoref, unknown}
decisionTrust rules over classifier
Designate pairs as co-referentCreate Clusters
introduction foaf co-reference approach methodology evaluation conclusions
Ingest
Extract triples from FOAF profilesAdd each foaf agent as new entity in
databaseEntity URLs followed in foaf:knows graph to
get additional information
introduction foaf co-reference approach methodology evaluation conclusions
Approach: System Architecture
introduction foaf co-reference approach methodology evaluation conclusions
ingestioningestion
candidate pair
generation
candidate pair
generation
rule-based reasoning
rule-based reasoning
machine learning
machine learning
Model Generation
Abstract entitygeneration
Potential pairs: reduces classifier workload
deductive decisions
deductive decisions predictionspredictions
clusters formnew abstract entities
Co-referent designation and clusteringCo-referent designation and clustering
Candidate Pairs
Filter pairs reduce matching setUse simple string matching predicates
Dice score for 3-gramsApply both to values of common properties
and also cross-property valuesExperiment 2 ~30% reduction Reductions vary based on data set
introduction foaf co-reference approach methodology evaluation conclusions
Input data sources
FOAF profiles extracted from SwoogleAlso used URLS extracted from tests
conducted in previous work
Distribution of URLs from Experiment 2
introduction foaf co-reference approach methodology evaluation conclusions
Methodology: Rule-based ModelRules conclude that two instances are co-
referent, not co-referent or draw no conclusion (the most common outcome)
Basic co-reference rule:{?p a owl:IFP. ?a ?p ?x. ?b ?p ?x) => {?a :coref ?b}
{?p a owl:FP . ?a ?p ?x. ?a ?p ?y.) => { ?x :coref ?y}
introduction foaf co-reference approach methodology evaluation conclusions
Methodology: Rule-based Model In text processing, very similar name mentions
in a document more likely to be co-referent It also is used in disambiguating name men-
tions in citations in a single paper or Web pageA similar heuristic is useful for a “knows graph”
extracted from a single foaf profile
{?a foaf:knows ?b. ?a foaf:knows ?c. ?b neq ?c} => {?b :notCoref ?c}
introduction foaf co-reference approach methodology evaluation conclusions
Methodology – Vector ModelSupport Vector Machine linear kernelFeatures:– Match/nomatch of any IFPs– Distance measures over common property
values (Levenshtein & 3-gram Dice score)– Alias and entity mention resolution– Property specific feature comparison– Knows graph comparisons: Jaccard coef of
similarity of foaf names of one-hop neighbors
introduction foaf co-reference approach methodology evaluation conclusions
Methodology: Clustering
Pairs form clustersClusters used as part of system evaluationCan result in:– Entity to Entity pairing
– Cluster to Entity pairing
– Cluster to Cluster pairing
Greedy process with a confidence thresholdUse rule-based model to eliminate known
non-coreferent pairs
introduction foaf co-reference approach methodology evaluation conclusions
Methodology – Clustering
Instance matching can result in new cluster formation and cluster matching can result in merged clusters.
introduction foaf co-reference approach methodology evaluation conclusions
Evaluation
Two experiments– E1: 50,000 triples, over 500 entity
mentions, 600 classes used for training– E2: 250,000 triples, over 3500 entity
mentions, over 1800 classes for training 10-fold cross-validation tests
introduction foaf co-reference approach methodology evaluation conclusions
Evaluation
Pairs Rule Conclusion
9138326 differentFrom Undetermined
47184 inverse functional Undetermined
2402 inverse functional Co-referent
8687410 knows graph Undetermined
9138326 sameAs Undetermined
1047874 knows Not Co-referent
For E1: 900 pairs non-match, majority undetermined
E2: Results shown below
introduction foaf co-reference approach methodology evaluation conclusions
Evaluation
Results promisingDuring our E2 clustering phase, the first
phase 90% accuracySecond phase no new relationships among
pairs, cluster to cluster pairing occurred
Classification Results using 10-fold Validation
introduction foaf co-reference approach methodology evaluation conclusions
Evaluation
Retrieving additional FOAF profiles based on knows graph
Quickly retrieve large number of entitiesTightly linked– reduced diversity of analyzed data–more entities that are co-referent
Future experiments: a diversity filter spanning domains
introduction foaf co-reference approach methodology evaluation conclusions
Future WorkEvaluating the contribution of each rule and
SVM feature to performanceOther ML approaches, e.g., markov logic, EMExploiting better clustering algorithmsAdding more features, e.g. non-foaf vocabu-
lary, non-RDF data (e.g., hosting site)Applying approach to other RDF instancesScalability:
Providing a non-batch, streaming serviceOffering a coref Web service
introduction foaf co-reference approach methodology evaluation conclusions
Conclusions
We can treat instance linking as co-reference resolution & exploit in-doc and xdoc distinction
Good results with an ensemble approach combining rules and an SVM classifier
Apply clustering to form groups of co-referent relations and reprocess
Promising initial results
introduction foaf co-reference approach methodology evaluation conclusions