clustering more than two million biomedical publications

26
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE 6(3): e1

Upload: adler

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Clustering More than Two Million Biomedical Publications. Comparing the Accuracies of Nine Text-Based Similarity Approaches. Boyack et al. (2011). PLoS ONE 6(3): e18029. Motivation . Compare different similarity measurements Make use of biomedical data set Process large corpus. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering More than Two Million  Biomedical Publications

Clustering More than Two Million Biomedical Publications

Comparing the Accuracies of Nine Text-Based Similarity Approaches

Boyack et al. (2011). PLoS ONE 6(3): e18029

Page 2: Clustering More than Two Million  Biomedical Publications

Motivation

• Compare different similarity measurements• Make use of biomedical data set• Process large corpus

Page 3: Clustering More than Two Million  Biomedical Publications

Procedures

1. define a corpus of documents2. extract and pre-process the relevant textual

information from the corpus3. calculate pairwise document-document similarities

using nine different similarity approaches4. create similarity matrices keeping only the top-n

similarities per document5. cluster the documents based on this similarity matrix6. assess each cluster solution using coherence and

concentration metrics

Page 4: Clustering More than Two Million  Biomedical Publications

Data• To build a corpus with titles, abstracts, MeSH terms, and

reference lists• Matched and combined data from the MEDLINE and Scopus

(Elsevier) databases • The resulting set was then limited to those documents

published from 2004-2008 that contained abstracts, at least five MeSH terms, and at least five references in their bibliographies

• resulting in a corpus comprised of 2,153,769 unique scientific documents

• Base matrix: word-document co-occurrence matrix

Page 5: Clustering More than Two Million  Biomedical Publications

Methods

Page 6: Clustering More than Two Million  Biomedical Publications

tf-idf

• The tf–idf weight (term frequency–inverse document frequency)

• A statistical measure used to evaluate how important a word is to a document in a collection or corpus

• The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Page 7: Clustering More than Two Million  Biomedical Publications

tf-idf

• •

Page 8: Clustering More than Two Million  Biomedical Publications

LSA

• Latent semantic analysis•

Page 9: Clustering More than Two Million  Biomedical Publications

LSA

Page 10: Clustering More than Two Million  Biomedical Publications

BM25

• Okapi BM25• A ranking function that is widely used by

search engines to rank matching documents according to their relevance to a query

Page 11: Clustering More than Two Million  Biomedical Publications

BM25

Page 12: Clustering More than Two Million  Biomedical Publications

SOM

• Self-organizing map • A form of artificial neural network that

generates a low-dimensional geometric model from high-dimensional data

• SOM may be considered a nonlinear generalization of Principal components analysis (PCA).

Page 13: Clustering More than Two Million  Biomedical Publications

SOM

1. Randomize the map's nodes' weight vectors2. Grab an input vector3. Traverse each node in the map

1. Use Euclidean distance formula to find similarity between the input vector and the map's node's weight vector

2. Track the node that produces the smallest distance (this node is the best matching unit, BMU)

4. Update the nodes in the neighbourhood of BMU by pulling them closer to the input vector1. Wv(t + 1) = Wv(t) + Θ(t)α(t)(D(t) - Wv(t))

5. Increase t and repeat from 2 while t < λ

Page 14: Clustering More than Two Million  Biomedical Publications

Topic modeling

• Three separate Gibbs-sampled topic models were learned at the following topic resolutions: T= 500, T= 1000 and T=2000 topics.

• Dirichlet prior hyperparameter settings of b= 0.01 and a = 0.05N/(D.T) were used, where N is the total number of word tokens, D is the number of documents and T is the number of topics.

Page 15: Clustering More than Two Million  Biomedical Publications

Topic modeling

Page 16: Clustering More than Two Million  Biomedical Publications

PMRA

• The PMRA ranking measure is used to calculate ‘Related Articles’ in the PubMed interface

• The de facto standard• Proxy

Page 17: Clustering More than Two Million  Biomedical Publications

Similarity filtering

• Reduce matrix size• Generate a top-n similarity file from each of

the larger similarity matrices• n=15, each document thus contributes

between 5 and 15 edges to the similarity file

Page 18: Clustering More than Two Million  Biomedical Publications

Clustering

• DrL (now called OpenOrd)• A graph layout algorithm that calculates an

(x,y) position for each document in a collection using an input set of weighted edges

• http://gephi.org/

Page 19: Clustering More than Two Million  Biomedical Publications

Evaluation

• Textual coherence (Jensen-Shannon divergence)

Page 20: Clustering More than Two Million  Biomedical Publications

Evaluation

• Concentration: a metric based on grant acknowledgements from MEDLINE, using a grant-to-article linkage dataset from a previous study

Page 21: Clustering More than Two Million  Biomedical Publications

Results

Page 22: Clustering More than Two Million  Biomedical Publications
Page 23: Clustering More than Two Million  Biomedical Publications
Page 24: Clustering More than Two Million  Biomedical Publications
Page 25: Clustering More than Two Million  Biomedical Publications

Results (cont.)

Page 26: Clustering More than Two Million  Biomedical Publications