bibliometric-enhanced retrieval models for big scholarly information systems
TRANSCRIPT
Bibliometric-enhanced Retrieval
Models for Big Scholarly Information
Systems
Workshop on Scholarly Big Data: Challenges and Ideas. IEEE BigData 2013
Intro
• What are Big Scholarly Information
Systems?
Intro
• What are bibliometric-enhanced IR
models?
– set of methods to quantitatively analyze
scientific and technological literature
– E.g. citation analysis (h-index)
– CiteSeer was a pioneer bibliometric-enhanced
IR system
Background
• DFG-funded (2009-2013): Projects IRM I and IRM II
– IRM = Information Retrieval Mehrwertdienste (value-added IR services)
• Goal: Implementation and evaluation of value-added IR services for
digital library systems
• Main idea: Applying scholarly (science) models for IR
Co-occurrence analysis of controlled vocabularies (thesauri)
Bibliometric analysis of core journals (Bradford’s law)
Centrality in author networks (betweenness)
• In IRM I we concentrated on the basic evaluation
• In IRM II we concentrate on the implementation of reusable (web)
services
4
http://www.gesis.org/en/research/external-funding-projects/archive/irm/
Search Term Recommender (Petras 2006)
Search Term Service: recommending strongly associated terms from controlled vocabulary
Bradfordizing (White 1981, Mayr 2009)
Bradford Law of Scattering (Bradford 1948): idealized example for 450 articles
Nucleus/Core: 150 papers in 3 Journals
Zone 2: 150 papers in 9 Journals
Zone 3: 150 papers in 27 Journals
Ranking by Bradfordizing: sorting the core journal papers / core books on top
bradfordized list of journals in informetrics applied to monographs: publisher as sorting criterion
Author Centrality (Mutschke 2001, 2004)
Ranking by Author Centrality: sorting central author papers on top
Scenarios for combined ranking services
iterative use : simultanous use:
Result Set
Core Journal Papers
Central Author Papers Relevant
Papers
Result Set
Central Author Papers Core Journal Papers
Prototye
http://multiweb.gesis.org/irsa/IRMPrototype
Evaluation
Main Research Issue:
Contribution to retrieval quality and usability
• Precision:
– Do central authors (core journals) provide more relevant hits?
– Do highly associated cowords have any positive effects?
• Value-adding effects:
– Do central authors (core journals) provide OTHER relevant hits?
– Do coword-relationships provide OTHER relevant search terms?
• Mashup effects:
– Do combinations of the services enhance the effects?
Evaluation Design
• precision in existing evaluation data:
– Clef 2003-2007: 125 topics; 65,297 SOLIS documents
– KoMoHe 2007: 39 topics; 31,155 SOLIS documents
• plausibility tests:
– author centrality / journal coreness ↔ precision
– Bradfordizing ↔ author centrality
• precision tests with users (Online-Assessment-Tool)
• usability tests with users (acceptance)
Evaluation of Bradfordizing on CLEF Data (Mayr 2013)
0,00
0,05
0,10
0,15
0,20
0,25
0,30
0,35
Bradford zones (core, z2, z3)
2003 articles 0,29 0,22 0,16
2004 articles 0,23 0,18 0,13
2005 articles 0,31 0,24 0,17
2006 articles 0,29 0,27 0,24
2007 articles 0,28 0,26 0,22
2005 monographs 0,21 0,16 0,19
2006 monographs 0,28 0,28 0,24
2007 monographs 0,24 0,21 0,23
core z2 z3
journal articles:
significant improvement
of precision from zone3
to core
monographs:
slight improvement of
precision distribution
between the three
zones
precision between Bradford zones (core, zone2 and zone3)
Evaluation of Author Centrality on CLEF Data
• moderate positive relationship between
rate of networking and precision
• precision of TF-IDF rankings (0.60)
significantly higher than author centrality
based rankings (0.31) – BUT:
• very little overlap of documents on top of
the ranking lists: 90% of relevant hits
provided by author centrality did not appear
on top of TF-IDF rankings
→ added precision of 28%
0
20
40
60
80
100
120
140
0 0,2 0,4 0,6 0,8 1 1,2
Gia
nt
Size
Precision
Correlation Precision10 - Giant Size: 0.25
• author centrality seems to favor OTHER
relevant documents than traditional rankings
• value-adding effect:
other view to the information space
avg number docs 517
avg number authors 664
avg number co-authors 302
avg giant size 24
Result: overlap
Intersection of
suggested top n=10
documents over all
topics and services
Mutschke et al. 2011
top 10 result lists
are marginal
overlapping!
IRSA
•
•
•
16
17
IRSA: Workflow
Analysis
18
Output
19
Returning suggestions for any query term
IRM & Modeling Science
measuring contribution
of bibliometric-enhanced services
to retrieval quality
deeper insights in
structure & functioning
of science
Bibliometric-enhanced
services
(structural attributes of
science system)
way towards a formal
model of science
References • Mutschke, P., Mayr, P., Schaer, P., & Sure, Y. (2011). Science models as value-
added services for scholarly information systems. Scientometrics, 89(1), 349–
364. doi:10.1007/s11192-011-0430-x
• Lüke, T., Schaer, P., & Mayr, P. (2013). A framework for specific term
recommendation systems. In Proceedings of the 36th international ACM SIGIR
conference on Research and development in information retrieval - SIGIR ’13
(pp. 1093–1094). New York, New York, USA: ACM Press.
doi:10.1145/2484028.2484207
• Mayr, P. (2013). Relevance distributions across Bradford Zones: Can
Bradfordizing improve search? In J. Gorraiz, E. Schiebel, C. Gumpenberger, M.
Hörlesberger, & H. Moed (Eds.), 14th International Society of Scientometrics
and Informetrics Conference (pp. 1493–1505). Vienna, Austria. Retrieved from
http://arxiv.org/abs/1305.0357
• Hienert, D., Schaer, P., Schaible, J., & Mayr, P. (2011). A Novel Combined Term
Suggestion Service for Domain-Specific Digital Libraries. In S. Gradmann, F.
Borri, C. Meghini, & H. Schuldt (Eds.), International Conference on Theory and
Practice of Digital Libraries (TPDL) (pp. 192–203). Berlin: Springer.
doi:10.1007/978-3-642-24469-8_21 22
Using IRSA
23
•
•
Thank you!
Dr Philipp Mayr
GESIS Leibniz Institute for the Social Sciences
Unter Sachsenhausen 6-8
50667 Cologne
Germany
24