effective string processing and matching for author entity
TRANSCRIPT
Effective String Processing and Matching for Author
Disambiguation
Source:Journal of Machine Learning Research’14Speaker:LIN,CI-JIE
Introduction
Track 2 of KDD Cup 2013 aims at determining duplicated authors in a data set from Microsoft Academic Search
Track 2 in KDD Cup 2013 is a task of name disambiguation
Overview of Data sets
1. Alleviation2. Unusual Name3. Inconsistent Information4. Typo5. Incomplete Name6. Empty Entry7. Missing Value8. Nickname9. Wrong matching between authors and papers10.Non-English characters
Main Strategies
The first strategy is that we identify duplicates mainly based on string matching
The second strategy is that if an author in Author.csv has no publication records in PaperAuthor.csv, then we assume that this author has no duplicates
The third strategy is to classify an author as Chinese or non-Chinese before any string matching
The First Implementation
2. Cleaning Split two consecutive uppercase characters
replace “CJ" with “C J.” Remove English honorics (e.g., “Mr." and “Dr.") Transform uppercase to lowercase Remove apostrophes and replace punctuation. For
example, “o'relly" becomes “orelly.“
The First Implementation
2. Cleaning Replace European alphabets with similar English
alphabets Replace common English nicknames
replace “bill” with “william."
The First Implementation
3. Selection build a dictionary of (key, value) pairs. Each key is a set
of words, while each value is a set of authors containing the key
The First Implementation
4. Identification Matching Functions
1. Two names share the same set of words “ Chih Jen Lin“ and “ Lin Chih Jen.„
2. A shortened name " Ch. J. Lin" and " Chih Jen Lin.“
3. A partially shortened name "C. J. Lin" and "C. J. Lint.“
4. Alias dry-run procedure
“ C J Lin “ and “ Chih Jen Lion “ are loosely identical, while “ Chih Jen Lin“ and “ Chen Ju Lin “ are not.
The First Implementation
5. Splitting Each author name is a full name with two words. Neither author name is a partially shortened
name of the other. " kazuo kobayashi" and "kunikazu kobayashi"
The Second Implementation
5. Splitting Common extended names(CEN) Longest common extended name(LCEN)
The Second Implementation
6. Linking previous stages group duplicated names rather
than identifiers. However, the competition task is to group
duplicated identifiers
Ensemble
the two implementations detect different sets of duplicates,an ensemble of their results may improve the performance
The filter considers that two authors have a similar background Affiliation、 field of study
Typo Correction
handle typos in author names of Author.csvTwo author names are considered as
duplicates iftheir word sets are the same after treating each
typo the same as after its correction,andtheir affiliations share at least one common word