n-gram topic models for bibliometric analysis gideon mann, david mimno, and andrew mccallum can...

Post on 24-Dec-2015

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

N-gram Topic Models for Bibliometric Analysis

Gideon Mann, David Mimno, and Andrew McCallum

Can topic models provide better measurements of the impact of research literature?

Bibliometrics and Scientometrics

Typically analyzes patterns of citations in research literature

Derek de Solla Price: “Little Science, Big Science”

Eugene Garfield: Science Citation Index, Journal Citation Reports

Comparing apples to apples: top journals by citations

Biochemistry and molecular biology:

J. Biol. Chem 405017

Cell 136472

Biochem.-US 96809

MathematicsLect. Notes Math 6926

T. Am. Math. Soc 6469

J. Math. Anal. Appl. 6004

Source: Journal Citation Reports (2004)

What’s wrong with grouping by journal?

• 10 of the 200 most cited papers in CiteSeer are unpublished technical reports, 15% of most cited papers are from conference proceedings

• Open-access publication increasing, but venue information often not available

• Hand entered ISI citation data noisy• Article has only one venue, journals

cover many topics

A topic model for N-grams

Determine whether the next word will be part of an n-gram based on the current word and the current hidden topic. “White house” is a collocation in politics, but may not be one in real estate.

Sample n-gram topics1. Digital Libraries (102): digital, electronic, library,

metadata, access; “digital libraries”, “digital library”, “electronic commerce”, “dublin core”, “cultural heritage”

2. WWW (129): web, site, pages, page, www, sites; “world wide web”, “web pages”, “web sites”, “web site”, “world wide”

3. Ontologies (186): semantic, ontology, ontologies, rdf, semantics, meta; “semantic web”, “description logics”, “rdf schema”, “description logic”, “resource description framework”

4. Web services (184): web, services, service, xml, business; “web services”, “web service”, “markup language”, “xml documents”, “xml schema”

Assigning topics to documents

1. Build a 200 topic n-gram topic model on 300k documents

2. Remove stopword or methodological topics (e.g. “efficient, fast, speed”)

3. For each document d, if more than 10% of d’s tokens are assigned to topic t, and that comprises more than two tokens, assign d to t

Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

Impact FactorJournal Impact Factor: Citations from

articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3.

2004 Impact factors from JCR:

Nature 32.182

Cell 28.389

JMLR 5.952

Machine Learning 3.258

Topic Impact Factor

Broad Impact: DiffusionJournal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100

Problem: relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

Broad Impact: DiversityTopic Diversity: Entropy of the distribution of citing topics

Better at capturing broad end of impact spectrum: the high diffusion topics are identical to the least frequently cited topics

Broad Impact: DiversityTopic Diversity: Entropy of the distribution of citing topics

Topic diversity can also be measured for papers:

Longevity: Cited Half LifeTwo views:• Given a paper, what is the median age of

citations to that paper?• What is the median age of citations from

current literature?

History: Topical Precedence

Within a topic, what are the earliest papers that received more than n citations?

Information Retrieval (138):On Relevance, Probabilistic Indexing and Information Retrieval,

Kuhns and Maron (1960)Expected Search Length: A Single Measure of Retrieval

Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968)

Relevance feedback in information retrieval, Rocchio (1971)Relevance feedback and the optimization of retrieval

effectiveness, Salton (1971)New experiments in relevance feedback, Ide (1971)Automatic Indexing of a Sound Database Using Self-organizing

Neural Nets, Feiten and Gunzel (1982)

Sharing: Topical Transfer

top related