disambiguation problems in digital libraries
DESCRIPTION
Disambiguation Problems in Digital Libraries. Tan Yee Fan 2006 August 11 WING Group Meeting. Introduction. Bibliographic digital libraries DBLP, Citeseer, ACM Portal, … Metadata records Authors, title, venue, year, … Inconsistencies and errors Typographical errors Abbreviation - PowerPoint PPT PresentationTRANSCRIPT
Disambiguation Problemsin Digital Libraries
Tan Yee Fan
2006 August 11
WING Group Meeting
Introduction
Bibliographic digital libraries DBLP, Citeseer, ACM Portal, … Metadata records
Authors, title, venue, year, …
Inconsistencies and errors Typographical errors Abbreviation Different entities sharing same name …
Problem formulation
General disambiguation problem Given a list of data items X Find a function δ : X × X → {0, 1} such that
δ(x1, x2) = 1 if x1 and x2 matches
δ(x1, x2) = 0 otherwise
Matching relation is not necessarily transitive δ(“ab”, “bc”) = 1 and δ(“bc”, “cd”) = 1,
but δ(“ab”, “cd”) = 0 If transitive, it is clustering/classification
Related fields
String similarity Edit distance, Jaro-Winkler, …
Abbreviation matching Mostly deals with biomedical texts and in
predefined formats Data cleaning
High level architectures by database people Social network analysis
Collaboration graphs of authors
Citation matching, author name disambiguation Can be cast as classification/clustering Usual information source
Coauthor information, titles and venues i.e. within the records themselves (internal)
Models Naïve Bayes, K-means, SVM, vector space
model, graphical models, … Some apply methods to reduce number of
comparisons required
Resources
Internal resources May contain insufficient information Information may be difficult to extract
External resources Web resources, ontologies Contains additional freely available information
Objective Combine internal and external resources
Mixed citation problem
Given an ambiguous name X (belonging to k different authors)
Given a list of citations C containing X Which citations in C belong to which author?
Yoojin Hong, Byung-Won On and Dongwon Lee. SystemSupport for Name Authority Control Problem in
Digital Libraries: OpenDBLP Approach. ECDL 2004.
Sudha Ram, Jinsoo Park and Dongwon Lee. DigitalLibraries for the Next Millennium: Challenges and
Research Directions. Information Systems Frontiers 1999.
Search engine results
For each citation c in C Query search engine with title of c to obtain
relevant URLs Represent c by a feature vector of relevant URLs
Each URL weighted by its inverse host frequency Cosine similarity between feature vectors
Perform clustering on C to derive k clusters
External coauthor network
Coauthor network from DBLP metadata
Delete the node representing X and its edges Similarity between two author names
computed as an inverse of their distance Similarity between two citations is pairwise
sum of their author similarities
Each noderepresents a name
Connected if they arecoauthors in someDBLP citation
Results
0.836
0.844
0.850
0.83
0.83
0.84
0.84
0.85
0.85
0.86
IHF (IP address, single link)
Coauthor linkage (complete link)
Combined (hybrid)
Venue name disambiguation
To determine e.g. “TREC” = “Text Retrieval Conference” Not using other parts of the citation records
Problems Abbreviations are extremely common Venues change name over time
Experiments using Google in progress Using URL features Using Google snippets