a real-time heuristic-based unsupervised method for name disambiguation in digital libraries
TRANSCRIPT
A Real-time Heuristic based Name Disambiguation Method for Digital Libraries
Muhammad Imran, Syed Zeeshan Haider Gillani, Maurizio Marchese
Outline
• Name Disambiguation problem
• Mixed and Split Citations
• Related work
• Our approach
• Experiments & results
• Conclusion
Name Disambiguation
Muhammad Imran
Author-1 Author-2 Author-3 Author-4
Multiple authorsshare same name
Muhammad Imran M. Imran Imran MuhammadName variation-1 Name variation-2 Name variation-3
One author with multiple
name variations
Name Disambiguation Types
M. Imran
Muhammad Imran Malik Imran Mehar Imran
Mixed citations
mixed citation recordsDL
Name Disambiguation Types
Muhammad Imran
Author-1 Author-2 Author-3
Split citations
split citations
DL
split citations
split citations
Related Work
• Supervision approaches • Generative (naïve Bayes)
• Discriminative (Support vector machines)
• Labor-intensive, high training cost
• Unsupervised approaches • Mostly failed to tackle name variations issue
• No users interventions
Our Contributions • An end-to-end system
• Retrieval -> pre-processing -> disambiguation
• A generic disambiguation approach • Unsupervised
• Heuristics based
• Involves Users’ feedback
Our Approach
Citation Records
CR
CR
CR
CR
CR
CR
CR
CR
cp
cp
cp
cp
CR
CR
CR
cp
cp
cp
Citation recordscontaining both mixed
and split
Discipline based clustering
a cluster
subset of citation records
Cluster selection
Co-author based split & buildingcandidate principal authors' list
Affiliation & candidate authors based merge
CR
CR
cp
cp
Title & homepage based merge
Principal cluster
selection
user
sel
ecte
d
CR
pa
user
sel
ecte
d
principal cluster
CR
pa
title based vector
title
title
list of candidate principal authors
principal author
Layer-3 Layer-4Layer-2Layer-1
Hierarchical Clustering & Feature Representation • Approaches
• Agglomerative
• Divisive Feature matrix (N x D)
Xi,j
N (cols) = No. of citation records D (rows) = No. of features
jth feature of ith citation record
Features: co-authorship • Joint authors of a book, article …
• Available across DLs
• We use it as: • Principal author
• Co-authors
{author-1, author-2, author-3, author-4, author-5}
citation record
principal author co-authors
Features: co-authorship • Heuristics “If a co-author appears in two different publications with a same principal author then most likely both publications belong to the principal author”
{author-1, author-2, ...}
citation record-1
principal author-1
author-2
citation record-2
{author-1, author-2, ...}
author-2=IF
=principal author-1
THEN
Features: Conference Venue • Venue represents an event name e.g., a
conference, workshop or a journal name.
• Available across DLs.
• Heuristics
“The venues information of two researchers, having same names, can differentiate one from the other based on examining disciplines and sub-disciplines information of a researcher's interest.”
Features: Author’s Affiliation • Author’s affiliation with an institute, university,
organization etc.
• Available across DLs.
• Heuristics
“If two publications with same principal author names, also share the same affiliation information then both publications will be considered as belongs to the same author.”
Features: Authors Names • An author’s name can have multiple name
variations.
• For example: Muhammad Imran • M. Imran
• Imran Muhammad
• Muhammad. I
Features: Publications titles • Title as a String literal
• We maintain a vector of important keywords
• Represents author’s interests
• Similarity measure between a given citation records and the vector can be useful
Features: Principal Author’s Homepage • Homepage is the URL of an author's
homepage.
Disambiguation System in Action • Inter-related disciplines based formation of
clusters
• Co-authors based split
• Affiliation based agglomerative
• Pursuit of the remaining bits
• Exploits venue/discipline information
• Forms relatively big clusters
• Involves users and consider their selection among clusters
Inter-related disciplines based formation of clusters
Inter-related disciplines based formation of clusters
• Inter-related disciplines based formation of clusters
Co-author Based Split • Using k-means clustering
Experiment & Evaluation
Dataset • 50 most ambiguous researchers
• Manually annotated a golden dataset
• Used DBLP as a data source
• Used ADANA as a base-line approach
• Used Precision, Recall and F1 as performance measures
Experiment & Evaluation
Thank you! Muhammad Imran