bibliometric-enhanced retrieval models for big scholarly information systems

Bibliometric-enhanced Retrieval

Models for Big Scholarly Information

Systems

[email protected]

Workshop on Scholarly Big Data: Challenges and Ideas. IEEE BigData 2013

Intro

• What are Big Scholarly Information

Systems?

Intro

• What are bibliometric-enhanced IR

models?

– set of methods to quantitatively analyze

scientific and technological literature

– E.g. citation analysis (h-index)

– CiteSeer was a pioneer bibliometric-enhanced

IR system

Background

• DFG-funded (2009-2013): Projects IRM I and IRM II

– IRM = Information Retrieval Mehrwertdienste (value-added IR services)

• Goal: Implementation and evaluation of value-added IR services for

digital library systems

• Main idea: Applying scholarly (science) models for IR

Co-occurrence analysis of controlled vocabularies (thesauri)

Bibliometric analysis of core journals (Bradford’s law)

Centrality in author networks (betweenness)

• In IRM I we concentrated on the basic evaluation

• In IRM II we concentrate on the implementation of reusable (web)

services

4

http://www.gesis.org/en/research/external-funding-projects/archive/irm/







Search Term Recommender (Petras 2006)

Search Term Service: recommending strongly associated terms from controlled vocabulary

Bradfordizing (White 1981, Mayr 2009)

Bradford Law of Scattering (Bradford 1948): idealized example for 450 articles

Nucleus/Core: 150 papers in 3 Journals

Zone 2: 150 papers in 9 Journals

Zone 3: 150 papers in 27 Journals

Ranking by Bradfordizing: sorting the core journal papers / core books on top

bradfordized list of journals in informetrics applied to monographs: publisher as sorting criterion

Author Centrality (Mutschke 2001, 2004)

Ranking by Author Centrality: sorting central author papers on top

Scenarios for combined ranking services

iterative use : simultanous use:

Result Set

Core Journal Papers

Central Author Papers Relevant

Papers

Result Set

Central Author Papers Core Journal Papers

Prototye

http://multiweb.gesis.org/irsa/IRMPrototype

Evaluation

Main Research Issue:

Contribution to retrieval quality and usability

• Precision:

– Do central authors (core journals) provide more relevant hits?

– Do highly associated cowords have any positive effects?

• Value-adding effects:

– Do central authors (core journals) provide OTHER relevant hits?

– Do coword-relationships provide OTHER relevant search terms?

• Mashup effects:

– Do combinations of the services enhance the effects?

Evaluation Design

• precision in existing evaluation data:

– Clef 2003-2007: 125 topics; 65,297 SOLIS documents

– KoMoHe 2007: 39 topics; 31,155 SOLIS documents

• plausibility tests:

– author centrality / journal coreness ↔ precision

– Bradfordizing ↔ author centrality

• precision tests with users (Online-Assessment-Tool)

• usability tests with users (acceptance)

Evaluation of Bradfordizing on CLEF Data (Mayr 2013)

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

Bradford zones (core, z2, z3)

2003 articles 0,29 0,22 0,16

2004 articles 0,23 0,18 0,13

2005 articles 0,31 0,24 0,17

2006 articles 0,29 0,27 0,24

2007 articles 0,28 0,26 0,22

2005 monographs 0,21 0,16 0,19

2006 monographs 0,28 0,28 0,24

2007 monographs 0,24 0,21 0,23

core z2 z3

journal articles:

significant improvement

of precision from zone3

to core

monographs:

slight improvement of

precision distribution

between the three

zones

precision between Bradford zones (core, zone2 and zone3)

Evaluation of Author Centrality on CLEF Data

• moderate positive relationship between

rate of networking and precision

• precision of TF-IDF rankings (0.60)

significantly higher than author centrality

based rankings (0.31) – BUT:

• very little overlap of documents on top of

the ranking lists: 90% of relevant hits

provided by author centrality did not appear

on top of TF-IDF rankings

→ added precision of 28%

0

20

40

60

80

100

120

140

0 0,2 0,4 0,6 0,8 1 1,2

Gia

nt

Size

Precision

Correlation Precision10 - Giant Size: 0.25

• author centrality seems to favor OTHER

relevant documents than traditional rankings

• value-adding effect:

other view to the information space

avg number docs 517

avg number authors 664

avg number co-authors 302

avg giant size 24

Result: overlap

Intersection of

suggested top n=10

documents over all

topics and services

Mutschke et al. 2011

top 10 result lists

are marginal

overlapping!

IRSA

•

•

•

16

17

IRSA: Workflow

Analysis

18

Output

19

Returning suggestions for any query term

Integration

20

www.sowiport.de is

using query suggestions

from IRSA

http://www.sowiport.de/

IRM & Modeling Science

measuring contribution

of bibliometric-enhanced services

to retrieval quality

deeper insights in

structure & functioning

of science

Bibliometric-enhanced

services

(structural attributes of

science system)

way towards a formal

model of science

References • Mutschke, P., Mayr, P., Schaer, P., & Sure, Y. (2011). Science models as value-

added services for scholarly information systems. Scientometrics, 89(1), 349–

364. doi:10.1007/s11192-011-0430-x

• Lüke, T., Schaer, P., & Mayr, P. (2013). A framework for specific term

recommendation systems. In Proceedings of the 36th international ACM SIGIR

conference on Research and development in information retrieval - SIGIR ’13

(pp. 1093–1094). New York, New York, USA: ACM Press.

doi:10.1145/2484028.2484207

• Mayr, P. (2013). Relevance distributions across Bradford Zones: Can

Bradfordizing improve search? In J. Gorraiz, E. Schiebel, C. Gumpenberger, M.

Hörlesberger, & H. Moed (Eds.), 14th International Society of Scientometrics

and Informetrics Conference (pp. 1493–1505). Vienna, Austria. Retrieved from

http://arxiv.org/abs/1305.0357

• Hienert, D., Schaer, P., Schaible, J., & Mayr, P. (2011). A Novel Combined Term

Suggestion Service for Domain-Specific Digital Libraries. In S. Gradmann, F.

Borri, C. Meghini, & H. Schuldt (Eds.), International Conference on Theory and

Practice of Digital Libraries (TPDL) (pp. 192–203). Berlin: Springer.

doi:10.1007/978-3-642-24469-8_21 22

Using IRSA

23

•

•

http://multiweb.gesis.org/irsa/IRMPrototype/

http://multiweb.gesis.org/irsa/IRMPrototype/

https://sourceforge.net/projects/irsa/

https://sourceforge.net/projects/irsa/

Thank you!

Dr Philipp Mayr

GESIS Leibniz Institute for the Social Sciences

Unter Sachsenhausen 6-8

50667 Cologne

Germany

[email protected]

24

mailto:[email protected]

bibliometric-enhanced retrieval models for big scholarly information systems

Technology

precision precision

author centrality mutschke

core monographs

core journal papers

zones precision

usability precision

central authors core

evaluation design precision