measuring academic influence: not all citations are equal

37
Xiaodan Zhu and Peter Turney National Research Council Canada Daniel Lemire TELUQ, Université du Québec Montréal Andre Vellino School of Information Studies, University of Ottawa, Ottawa Measuring Academic Influence: Not All Citations Are Equal

Upload: andre-vellino

Post on 10-May-2015

909 views

Category:

Technology


0 download

DESCRIPTION

The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. The research described in this presentation – work that was performed with Xiaodan Zhu, Peter Turney (National Research Council Canada) and Daniel Lemire (TELUQ, Université du Québec à Montréal) – aims to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. To achieve this we examined the effectiveness of a variety of features in the citing paper that might plausibly predict the academic influence of a citation. We asked a group of authors to identify the key references in their own work and created a dataset in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we developed a model for predicting academic influence that achieves good performance on this dataset using only four features. The performance of these features inspired us to design an influence-primed h-index (the hip-index). According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index.

TRANSCRIPT

Page 1: Measuring academic influence: Not all citations are equal

Xiaodan Zhu and Peter Turney National Research Council Canada

Daniel Lemire TELUQ, Université du Québec Montréal

Andre Vellino School of Information Studies, University of Ottawa, Ottawa

Measuring Academic Influence: Not All Citations Are Equal

Page 2: Measuring academic influence: Not all citations are equal

Overview �  Some background in Citation Analysis �  What we tried to do and why �  How we did it �  What the results were �  What the implications are

Page 3: Measuring academic influence: Not all citations are equal

What is Citation Analysis Citation analysis refers to the collection of methods for measuring the importance of scholars, journals and institutions by counting citations in a graph of references in the published literature.

… …

Page 4: Measuring academic influence: Not all citations are equal

Why Do Citation Analysis?

�  Reason # 1: Because it generates measurable quantities!

“Since we can’t really measure what interests us, we begin to be interested in what we can measure”

Joel Westheimer Professor of Education

University of Ottawa

Page 5: Measuring academic influence: Not all citations are equal

Uses for Citation Measures �  For Readers

� To evaluate the quality of articles / journals

�  For Universities �  To evaluate the productivity of academics � To help in tenure and promotion decisions

�  For Journals � To attract authors to publish

�  For Libraries � To make collections / acquisition decisions � To make automated recommendations to users

Page 6: Measuring academic influence: Not all citations are equal

How Are Citations Counted? �  Add 1 for every new occurrence of a cited article �  Sum the results �  Average per article & / or Count Total # of citations Problems �  Self citations! �  No measure of quality of citing source �  May be skewed by a small number of highly cited items �  Easy to “game” by tricking Google Scholar

�  viz. Ike Inktare h-index = 94 – Einstein h-index = 84

Page 7: Measuring academic influence: Not all citations are equal

h-index �  Jorge Hirsch (PNAS, 2005) defined the h-index:

� Attempts to measure both the productivity and impact of the author’s published work

� An author has index h if h of their N papers have at least h citations each, and the other (N − h) papers have at most h citations each.

Page 8: Measuring academic influence: Not all citations are equal

Some Criticisms of the h-index �  The h-index does not account for the number of authors or the order of

the authors of a paper. �  Cannot use the h-index to compare authors in different fields �  Young researchers with as yet short careers are at a built-in disadvantage

over older researchers �  Constrained by the total number of publications

�  10 papers each w/ 100 citations each = 10 papers w/ 10 citation each

“[h-index] captures a small amount of information about the distribution of a scientist's citations [and] loses crucial information that is essential for the assessment of research.” 

Adler, R., Ewing, J. Taylor, P. Citation statistics. A report from the International Mathematical Union. http://www.mathunion.org/fileadmin/IMU/Report/CitationStatistics.pdf

Page 9: Measuring academic influence: Not all citations are equal

Journal Impact factor (IF) �  Invented by Eugene Garfield in 1955 to identify journals for

Science Citation Index �  Definition:

Total Citations (2 preceding years )

Total Articles (2 preceding years ) = JIF

i.e. the impact factor of a journal is the average number of citations to those papers that were published during the two preceding years ¨  e.g. the number of times articles published in 2001 and 2002

were cited by indexed journals during 2003 / the total number of items published in 2001 and 2002

Page 10: Measuring academic influence: Not all citations are equal

Some Criticisms of Impact Factor �  Letters or editorials in some journals (e.g. Nature) are often cited

(and counted) in “Total Citations” (numerator) but not in “Total Articles”

�  2-year window not applicable in many fields (e.g. in Math 90% of citations fall outside the 2-year window)

�  IF varies considerably across disciplines (Math has an average of 0.9 citation per article, Life Sciences have an average of 6.2)

“Using the impact factor alone to judge a journal is like using weight alone to judge a person's health.” 

Adler, R., Ewing, J. Taylor, P. Citation statistics. A report from the International Mathematical Union. http://www.mathunion.org/fileadmin/IMU/Report/CitationStatistics.pdf

Page 11: Measuring academic influence: Not all citations are equal

What We Did and Why

Page 12: Measuring academic influence: Not all citations are equal

�  As early as 1965 Garfield identified 15 different reasons for citing �  giving credit for related work �  correcting a work �  criticizing previous work

�  Many attempts since to categorize citations

One Big Assumption All citations should count equally!

Page 13: Measuring academic influence: Not all citations are equal

Citation Typing Ontology (CiTO)

Here are first 21 of the 91 citation types in CiTO

http://imageweb.zoo.ox.ac.uk/pub/2008/plospaper/latest/#refs

Example of semantically annotated article using CiTO:

Page 14: Measuring academic influence: Not all citations are equal

Our Objective �  Solve a binary classification problem:

Given a Paper-Reference (P-R) pair, does P-R belong to the class “R is highly influential for P” or not.

Our Method �  Apply Machine Learning methods to train a computer to

recognize “Highly Influential Reference” from examples

Page 15: Measuring academic influence: Not all citations are equal

Step 1 – Data Collection

We believe that most papers are based on 1, 2, 3 or 4 essential references. By an essential reference, we mean a reference that was highly influential or inspirational for the core ideas in your paper; that is, a reference that inspired or strongly influenced your new algorithm, your experimental design, or your choice of a research problem. Other references merely support the work.

Page 16: Measuring academic influence: Not all citations are equal

We asked for �  Title of your paper (research papers only; no surveys) �  The essential references does your paper build?

We got �  100 papers �  322 “influential” references

�  i.e. 3.2 “influential references” per article �  Each paper

�  Contained ~ 31 references in the References section �  Cited ~ 54 references in the body of the paper

�  i.e. each reverence was cited an average of 1.7 times per paper

Page 17: Measuring academic influence: Not all citations are equal

The Problem �  The 100 papers yield 3143 paper-reference pairs �  The authors have selected ~320 paper-reference pairs

�  Algorithmically: to accurately select those 320 from the 3142

Page 18: Measuring academic influence: Not all citations are equal

Paper – Reference Analysis �  OpenNLP used to detect sentence boundaries and tokenize. �  ParsCit to parse the papers.

�  ParsCit is an open-source package for parsing references and document structure in scientific papers.

�  Regular expressions to capture citation occurrences in paper bodies that were not detected by ParsCit.

Page 19: Measuring academic influence: Not all citations are equal

Characteristics of Corpus

Page 20: Measuring academic influence: Not all citations are equal

We Looked at 5 Classes of Features 1.  Count-based features 2.  Similarity-based features 3.  Context-based features 4.  Position-based features 5.  Miscellaneous features

Page 21: Measuring academic influence: Not all citations are equal

Count Based Features �  Total number of times a paper is referenced in the citing paper �  The number of different sections in which a given reference appears �  Number of times a paper is referenced in the

�  “Related” section �  “Introduction” section �  “Core” sections (all sections excluding “Related”, “Introduction”,

“Acknowledgements”, “Conclusion” and “Future Work” �  The number of different sections in which a reference appears

Page 22: Measuring academic influence: Not all citations are equal

Content-Similarity Based Features Citing article Referenced articles

Title-Title

Title-Abstract

Title-Conclusion

Title-Introduction

Title-Core

Page 23: Measuring academic influence: Not all citations are equal

Citing Context �  When an article is cited, the linguistic context in which the

article is cited is considered as saying something about the cited article.

e.g. “Like Moravcsik and Murugesan (1975), we are concerned about the side effects of counting insignificant references”

Page 24: Measuring academic influence: Not all citations are equal

Context-Similarity Based Features Citing Article

Title Abstract Introduction Conclusion

Page 25: Measuring academic influence: Not all citations are equal

Other Context Based Features �  Authors explicitly mentioned in citation context? �  Citation alone [4] or with others [3,4,5] �  If “with others” is it first? (e.g. “[3]” is first in “[3,4,5]”) Using pre-defined word-lists, is the lexical content of a citation �  “relevant” [likewise, influential, inspiring useful….] �  “new” [recently, latest, current, improved…] �  “extreme” [greatly, intensely, acutely, almighty, awfully] �  “comparative” [easy, easier, easiest, strong, stronger…]

Page 26: Measuring academic influence: Not all citations are equal

Lexical Context Features Using a lexicon of 114,271 words obtained from the General Inquirer Lexicon (11,788 words) extended w/ Wordnet + Turney and Littman Algorithm, �  Count the number of words labeled

�  “Strong” �  “Positive” �  “Evaluative”

Also, sentiment analysis with a different lexicon gave us �  Presence / absence of “Emotion” (Joy, Sadness, Anger, Fear, etc.) �  “Positive” / “Negative”

Page 27: Measuring academic influence: Not all citations are equal

Position Based Features Where does the citation occur? �  Citation appears at the beginning of a sentence? (Y/ N) �  Citation appears at the end of a sentence? (Y/N) �  Where are the sentence(s) in which the citation(s) occur(s)

e.g. �  0 (First sentence) to 1 (Last sentence) �  distance from the mean of occurrences of all citations

Page 28: Measuring academic influence: Not all citations are equal

Count Based Features

Similarity Based Features

Context Based Features

Position Based Features

Misc. Features

Page 29: Measuring academic influence: Not all citations are equal

Top 7 Features: 4 “counts”, 3 “similarity”

Counts in Paper Counts in Sections Counts in Core Section Title-Abstract Similarity Counts in Intro Section Title-Core Similarity Title-Intro Similarity

Page 30: Measuring academic influence: Not all citations are equal

Conventional Measures on Citation Graph

… …

C R 1

Page 31: Measuring academic influence: Not all citations are equal

Influence Primed Measures

… …

C R X

where X = (number of times C cites R)2

Page 32: Measuring academic influence: Not all citations are equal

hip-index �  Each occurrence of a citation of paper R by paper C = 1 �  hip-index (h-influence-primed) index for an author is the

largest number h such that at least h of the author's papers have an influence-primed citation count of at least h.

Page 33: Measuring academic influence: Not all citations are equal

Examples hip-index = 5 h-index = 2 cited 3 times by C1 = 9 cited 2 times by C2 = 4 cited 2 times by C3 = 4 cited 2 times by C4 = 4 R3 – cited 3 times by C5 = 9 R4 – cited 3 times by C6 = 9 R5 – cited 3 times by C7 = 9 R6 – cited 2 times by C8 = 4 R7 – cited 1 times by C9 = 1

13

8

9

9

9 4

1

hip-index = 3 h-index = 2 cited 2 times by C1 = 4 cited 1 times by C2 = 1 cited 2 times by C3 = 4 cited 1 times by C4 = 1 R3 – cited 2 times by C5 = 4 R4 – cited 1 times by C6 = 1 R5 – cited 1 times by C7 = 1 R6 – cited 1 times by C8 = 1 R7 – cited 1 times by C9 = 1

5

5

4

1

1 1

1

R1

R2

R1

R2

Page 34: Measuring academic influence: Not all citations are equal

Using hip-index to Predict ACM Fellows �  Used the citation network constructed from �  ~ 20,000 papers in the Association for Computational Linguistics

Anthology

�  Calculated the h-index of ACL Fellows �  Calculated the hip-index of ACL Fellows �  Compared the precision of h-index and hip-index

�  the number of ACL Fellows in the top N divided by N

Page 35: Measuring academic influence: Not all citations are equal

1/2 2/3

1/4

2/6

3/10 3/9

4/11 4/10 5/11

5/12

Page 36: Measuring academic influence: Not all citations are equal

Conclusions �  We can throw away h-index and Impact Factor etc. completely

OR we can try to improve them by counting citations more relevantly

�  A measure of academic influence for a citation is possible and �  It is easy to compute to a first approximation – merely count

their frequency �  Apply the influence-primed weights on citation graphs to

compute �  Influence-primed Impact Factor, g-index etc.

Page 37: Measuring academic influence: Not all citations are equal

Thanks!