disambiguation problems in digital libraries

12
Disambiguation Problems in Digital Libraries Tan Yee Fan 2006 August 11 WING Group Meeting

Upload: ronny

Post on 19-Jan-2016

23 views

Category:

Documents


1 download

DESCRIPTION

Disambiguation Problems in Digital Libraries. Tan Yee Fan 2006 August 11 WING Group Meeting. Introduction. Bibliographic digital libraries DBLP, Citeseer, ACM Portal, … Metadata records Authors, title, venue, year, … Inconsistencies and errors Typographical errors Abbreviation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Disambiguation Problems in Digital Libraries

Disambiguation Problemsin Digital Libraries

Tan Yee Fan

2006 August 11

WING Group Meeting

Page 2: Disambiguation Problems in Digital Libraries

Introduction

Bibliographic digital libraries DBLP, Citeseer, ACM Portal, … Metadata records

Authors, title, venue, year, …

Inconsistencies and errors Typographical errors Abbreviation Different entities sharing same name …

Page 3: Disambiguation Problems in Digital Libraries
Page 4: Disambiguation Problems in Digital Libraries

Problem formulation

General disambiguation problem Given a list of data items X Find a function δ : X × X → {0, 1} such that

δ(x1, x2) = 1 if x1 and x2 matches

δ(x1, x2) = 0 otherwise

Matching relation is not necessarily transitive δ(“ab”, “bc”) = 1 and δ(“bc”, “cd”) = 1,

but δ(“ab”, “cd”) = 0 If transitive, it is clustering/classification

Page 5: Disambiguation Problems in Digital Libraries

Related fields

String similarity Edit distance, Jaro-Winkler, …

Abbreviation matching Mostly deals with biomedical texts and in

predefined formats Data cleaning

High level architectures by database people Social network analysis

Collaboration graphs of authors

Page 6: Disambiguation Problems in Digital Libraries

Citation matching, author name disambiguation Can be cast as classification/clustering Usual information source

Coauthor information, titles and venues i.e. within the records themselves (internal)

Models Naïve Bayes, K-means, SVM, vector space

model, graphical models, … Some apply methods to reduce number of

comparisons required

Page 7: Disambiguation Problems in Digital Libraries

Resources

Internal resources May contain insufficient information Information may be difficult to extract

External resources Web resources, ontologies Contains additional freely available information

Objective Combine internal and external resources

Page 8: Disambiguation Problems in Digital Libraries

Mixed citation problem

Given an ambiguous name X (belonging to k different authors)

Given a list of citations C containing X Which citations in C belong to which author?

Yoojin Hong, Byung-Won On and Dongwon Lee. SystemSupport for Name Authority Control Problem in

Digital Libraries: OpenDBLP Approach. ECDL 2004.

Sudha Ram, Jinsoo Park and Dongwon Lee. DigitalLibraries for the Next Millennium: Challenges and

Research Directions. Information Systems Frontiers 1999.

Page 9: Disambiguation Problems in Digital Libraries

Search engine results

For each citation c in C Query search engine with title of c to obtain

relevant URLs Represent c by a feature vector of relevant URLs

Each URL weighted by its inverse host frequency Cosine similarity between feature vectors

Perform clustering on C to derive k clusters

Page 10: Disambiguation Problems in Digital Libraries

External coauthor network

Coauthor network from DBLP metadata

Delete the node representing X and its edges Similarity between two author names

computed as an inverse of their distance Similarity between two citations is pairwise

sum of their author similarities

Each noderepresents a name

Connected if they arecoauthors in someDBLP citation

Page 11: Disambiguation Problems in Digital Libraries

Results

0.836

0.844

0.850

0.83

0.83

0.84

0.84

0.85

0.85

0.86

IHF (IP address, single link)

Coauthor linkage (complete link)

Combined (hybrid)

Page 12: Disambiguation Problems in Digital Libraries

Venue name disambiguation

To determine e.g. “TREC” = “Text Retrieval Conference” Not using other parts of the citation records

Problems Abbreviations are extremely common Venues change name over time

Experiments using Google in progress Using URL features Using Google snippets