author name disambiguation for citations using topic and web correlation

Click here to load reader

Upload: kenny-boston

Post on 11-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1

Author Name Disambiguation for Citations Using Topic and Web Correlation Slide 2 Prior work Supervised classification approaches: Model all authors patterns from a set of training data. Unsupervised Classification approaches: Ambiguous citations are clustered into groups of distinct authors by measuring the similarities between the attributes in the citations. Slide 3 Proposed Approach Topic Correlation Web Correlation Pair-Wise Grouping Algorithm Slide 4 Topic Correlation Build a topic association network 1. Apriori 2. k-way hypergraph partition 3. topic association network citations Slide 5 Web Correlation Use each title to query a search engine. Filter the URLs of several digital libraries. If two citations appear in the same URL, we use them as an instance of Web correlation. Slide 6 Pair-Wise Grouping Algorithm Generate pairs of citations by using similarity metrics Use the training data to train a binary classifier Apply the classifier to determine whether the pairs are matched Combine the predicted results to group the citations into appropriate clusters. Filter out the pairs that would cause the clusters sparse. Slide 7 Pair-Wise Similarity Metrics similarity metrics for Coauthor, Title, and Venue: 1.CSM 2.MSF Similarity metrics for topic correlation: TSM Similarity metrics for web correlation: MNDF Slide 8 Binary Classifier A binary classifier is used to learn the distribution of pair-wise vectors. The pairs predicted as matched are used to build citation clusters ( constructing an undirected graph). Slide 9 Cluster Filter A threshold is set for choosing which bridges should be removed. A bridge is removed if the numbers of vertices in two separate, but connected, components are above the given threshold. Slide 10 Detecting Ambiguous Author Names in Crowdsourced Scholarly Data Slide 11 Prior Work Name disambiguation has been cast into the problem of clustering a set of publications into profiles such that each profile corresponds to a single author. Slide 12 Name Variations and Citations Extract the name variations from a collection of publications Sort them by number of citations Look at the percentage of the total citations that are attributed to the top name variations.( A high percentage suggests that the name is not ambiguous.) Slide 13 Topic Consistency Leverage the discipline tags crowdsourced from the users of the Scholarometer system Detect different but related disciplines associated with an author name: Map an authors publications to topics, and measure the similarity between these topics. Derive an authors topic profile Slide 14 A brief survey of automatic methods for author name disambiguation Slide 15 Two problems Synonyms: the same author may appear under distinct names Polysems: distinct authors may have similar names. Slide 16 Proposed taxonomy Slide 17 Author Grouping Methods Defining a similarity function: 1.Using predefined functions: the Levenshtein distance, Jaccard coefficient, cosine similarity, soft-TFIDF and others. 2.Learning a similarity function: Use the training data to produce a similarity function S from R*R(R: the set of references) to {0, 1}, where 1 means that the two references do refer to the same author and 0 means that they do not. 3.Exploiting graph-based similarity functions: Create a coauthorship graph G=(V, E) for each ambiguous group. The same coauthor names are represented by a vertex, and the weight is related to the amount of articles coauthored by the corresponding author names represented by the two vertices. Slide 18 Author Grouping Methods Clustering Techniques: 1.Partitioning 2.Hierarchical agglomerative clustering 3.density-based clustering 4.Spectral clustering Slide 19 Author assignment methods Classification: Assign the references to their authors using a supervised machine learning technique. Clustering: Use probabilistic techniques to determine the author in a iterative way to fit the model. Slide 20 Explored evidence Citation information: the attributes directly extracted from the citations, such as author/coauthor names, work title, publication venue title, year, and so on. Web information: Data retrieved from the web that is used as additional information about an author publication profile. Implicit evidence: Evidence inferred from visible elements of attributes, such as the latent topics of a citation. Slide 21 Summary of characteristics-Author grouping methods Slide 22 Summary of characteristics-Author assignment methods Slide 23 Open challenges Very little data in the citations Very ambiguous cases -- ambiguous references will have coauthors who have also ambiguous names (especially Asian names) Citations with errors Efficiency Different knowledge areas -- our focus is only about computer science Incremental disambiguation Author profile changes New authors Slide 24 pandasearch implicit evidence web information cv Slide 25 pandasearch Type of approach: author grouping methods learning a similarity function. Explored evidence: citation information, web information, implicit evidence.