a real-time heuristic-based unsupervised method for name disambiguation in digital libraries

23
A Real-time Heuristic based Name Disambiguation Method for Digital Libraries Muhammad Imran , Syed Zeeshan Haider Gillani, Maurizio Marchese

Upload: muhammad-imran

Post on 17-Jul-2015

102 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

A Real-time Heuristic based Name Disambiguation Method for Digital Libraries

Muhammad Imran, Syed Zeeshan Haider Gillani, Maurizio Marchese

Page 2: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Outline

•  Name Disambiguation problem

•  Mixed and Split Citations

•  Related work

•  Our approach

•  Experiments & results

•  Conclusion

Page 3: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Name Disambiguation

Muhammad Imran

Author-1 Author-2 Author-3 Author-4

Multiple authorsshare same name

Muhammad Imran M. Imran Imran MuhammadName variation-1 Name variation-2 Name variation-3

One author with multiple

name variations

Page 4: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Name Disambiguation Types

M. Imran

Muhammad Imran Malik Imran Mehar Imran

Mixed citations

mixed citation recordsDL

Page 5: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Name Disambiguation Types

Muhammad Imran

Author-1 Author-2 Author-3

Split citations

split citations

DL

split citations

split citations

Page 6: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Related Work

•  Supervision approaches •  Generative (naïve Bayes)

•  Discriminative (Support vector machines)

•  Labor-intensive, high training cost

•  Unsupervised approaches •  Mostly failed to tackle name variations issue

•  No users interventions

Page 7: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Our Contributions •  An end-to-end system

•  Retrieval -> pre-processing -> disambiguation

•  A generic disambiguation approach •  Unsupervised

•  Heuristics based

•  Involves Users’ feedback

Page 8: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Our Approach

Citation Records

CR

CR

CR

CR

CR

CR

CR

CR

cp

cp

cp

cp

CR

CR

CR

cp

cp

cp

Citation recordscontaining both mixed

and split

Discipline based clustering

a cluster

subset of citation records

Cluster selection

Co-author based split & buildingcandidate principal authors' list

Affiliation & candidate authors based merge

CR

CR

cp

cp

Title & homepage based merge

Principal cluster

selection

user

sel

ecte

d

CR

pa

user

sel

ecte

d

principal cluster

CR

pa

title based vector

title

title

list of candidate principal authors

principal author

Layer-3 Layer-4Layer-2Layer-1

Page 9: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Hierarchical Clustering & Feature Representation •  Approaches

•  Agglomerative

•  Divisive Feature matrix (N x D)

Xi,j

N (cols) = No. of citation records D (rows) = No. of features

jth feature of ith citation record

Page 10: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Features: co-authorship •  Joint authors of a book, article …

•  Available across DLs

•  We use it as: •  Principal author

•  Co-authors

{author-1, author-2, author-3, author-4, author-5}

citation record

principal author co-authors

Page 11: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Features: co-authorship •  Heuristics “If a co-author appears in two different publications with a same principal author then most likely both publications belong to the principal author”

{author-1, author-2, ...}

citation record-1

principal author-1

author-2

citation record-2

{author-1, author-2, ...}

author-2=IF

=principal author-1

THEN

Page 12: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Features: Conference Venue •  Venue represents an event name e.g., a

conference, workshop or a journal name.

•  Available across DLs.

•  Heuristics

“The venues information of two researchers, having same names, can differentiate one from the other based on examining disciplines and sub-disciplines information of a researcher's interest.”

Page 13: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Features: Author’s Affiliation •  Author’s affiliation with an institute, university,

organization etc.

•  Available across DLs.

•  Heuristics

“If two publications with same principal author names, also share the same affiliation information then both publications will be considered as belongs to the same author.”

Page 14: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Features: Authors Names •  An author’s name can have multiple name

variations.

•  For example: Muhammad Imran •  M. Imran

•  Imran Muhammad

•  Muhammad. I

Page 15: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Features: Publications titles •  Title as a String literal

•  We maintain a vector of important keywords

•  Represents author’s interests

•  Similarity measure between a given citation records and the vector can be useful

Page 16: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Features: Principal Author’s Homepage •  Homepage is the URL of an author's

homepage.

Page 17: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Disambiguation System in Action •  Inter-related disciplines based formation of

clusters

•  Co-authors based split

•  Affiliation based agglomerative

•  Pursuit of the remaining bits

Page 18: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

•  Exploits venue/discipline information

•  Forms relatively big clusters

•  Involves users and consider their selection among clusters

Inter-related disciplines based formation of clusters

Page 19: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Inter-related disciplines based formation of clusters

•  Inter-related disciplines based formation of clusters

Page 20: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Co-author Based Split •  Using k-means clustering

Page 21: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Experiment & Evaluation

Dataset •  50 most ambiguous researchers

•  Manually annotated a golden dataset

•  Used DBLP as a data source

•  Used ADANA as a base-line approach

•  Used Precision, Recall and F1 as performance measures

Page 22: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Experiment & Evaluation

Page 23: A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

Thank you! Muhammad Imran

[email protected]