semantic journal mapping for search visualization in a large scale article digital library

Post on 11-Nov-2014

2.512 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

In this paper we examine the scalability and utility of semantically mapping (visualizing) journals in a large scale (5.7+ million) science, technology and medical article digital library. This work is part of a larger research effort to evaluate semantic journal and article mapping for search query results refinement and visual contextualization in a large scale digital library. In this work the Semantic Vectors software package is parallelized and evaluated to create semantic distances between 2365 journals, from the sum of their full-text. This is used to create a journal semantic map whose production does scale and whose results are comparable to other maps of the scientific literature.

TRANSCRIPT

Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library

Glen Newton1,2, Alison Callahan1, Michel Dumontier2

1National Research Council Canada, 2Carleton University

Second Workshop on Very Large Digital Libraries (VLDL) 2009

at ECDL 2009

Oct 2 2009 Corfu, Greece

Outline

• Maps of Science• Background• Research Interests• Research Goals• Process• Scalability issues• Environment• Results• Conclusions• Future Work

From Bollen et al 2009 PLOS1

From Leydesdorff & Rafols 2006From Leydesdorff & Rafols 2006

From Leydesdorff & Rafols 2006

Background

• Canada Institute of Science and Technical Information (CISTI) == Canadian national science library

• ~3000 active researchers at NRC• Large full text collection of ~8.4m full-text + metadata articles, in

science, technology, medicine (STM)• 4100 journal titles• ~1995 to 2009

Research Interests

• Domain-specific discovery• Improved discovery in STM domains through results visualization

and contextualization, browse/explore/refine• Results set visualization: “mapping”

Research Goals

• Find way to extract journal (& article) semantic vector space• Latent Semantic Analysis (LSA) works for small/medium sized

corpora, does not scale to large scale of items and/or terms• New alternative: Semantic Vectors (SV): uses random vectors &

avoids expensive singular value decomposition (SVD)• Can SV scale & generate sensible semantic vector space of

journals on corpus of this size?• Can the visualization produced be useful for results query

visualization, refinement, discovery?

Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer, etc

• ~4100 journal titles, classified into 23 categories (by librarians)• ~8.4m journal articles• Selection of articles/journals:

– Only those with authors, abstract (no notices, obituaries, etc)– Only English language articles– Only journals with >50 articles in corpus– Resulting corpus: 5,733,721 articles from 2231 journals – Categories overlapping: 1.53 categories per journal

Category # Journals per category

Agriculture & Biological Sciences 358

Arts and Humanities 70

Biochemistry, Genetics and Molecular Biology 240

Business, Management and Accounting 106

Chemical Engineering 126

Chemistry 226

Civil Engineering 64

Computer Science 218

Decision Science 50

Earth and Planetary Science 146

Economics, Econometrics and Finance 112

Category # Journals per category

Energy and Power 73

Engineering and Technology 328

Environmental Science 138

Immunology and Microbiology 104

Materials Science 160

Mathematics 205

Medicine 671

Neuroscience 103

Pharmacology, Toxicology and Pharmaceutics

73

Physics and Astronomy 210

Psychology 126

Social Science 222

Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list, Porter stemming using LuSql tool

• Build Semantic Vectors (v1.18, parallelized) index from Lucene index, with 512 semantic dimensions

• Find item x item distance matrix from SV index of 512-dimensional vectors

• Using R, use multidimensional scaling (MDS) to reduce from 512-D to 2-D

Scalability Issues

• #items, #unique terms– #unique terms: SV easily handles very well– #items: SV handles fairly well – #items: impacts size of distance matrix (#items x #items) – R cannot handle huge article distance matrix in MDS (i.e.

millions of articles vs. thousands of journals)

• Instead of using articles for items, use journals for items• Make single large full-text document from concatenation of all

articles of particular journal & index these

Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM, attached to a Dell EMC AX150 storage arrays via SilkWorm 200E Series 16-Port Capable 4Gb Fabric Switch.

• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel 2.6.18.8-0.10-default \#1 SMP

• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-BitServer VM (build 10.0-b23, mixed mode).

• Processing 1.0 (processing.org)

Results: Scalability

• Corpus: ~600GB full-text• Lucene index: 43GB

– LuSql: 13 hours 51 minutes to produce

• SV index: 58 minutes, 885 MB, 21.6m terms– Distance matrix: 6 minutes

Results: Visualization

• Using Processing environment, built simple validation/visualization tool

Harder sciences and engineering categories

Chemistry

Material Science

Physics andAstronomy

Engineering and Technology

Mathematics

Computer Science

Civil Engineering

Chemical Engineering

Agriculture and biomedical categories

Agriculture and Biological Sciences

Biochemistry, Geneticsand Molecular Biology

Immunology andMicrobiology

Pharmacology

Neuroscience

MedicineMedicine

Psychology

Interdisciplinary and non-science categories

Environmental Science

Earth andPlanetary Science

Energy and Power

Decision Science

Economics,EconometricsAnd Finance

Social Sciences

Business, Managementand Accounting

Arts and Humanities

Examination of outliers, extrema and cataloging errors

Environmental Science

Ecotoxicology and Environmental Safety

Corporate EnvironmentalStrategy

Organic Geochemistry

MedicineMedicine

Journal of X-Ray Science and Technology

Journal of Biomolecular NMR

MedicineMedicine

Annales Henri Poincare

Colloidal and Polymer Science

Medicine

Medicine

French language Medical & Psychology Journals

Mathematics

Journal ofMedicalUltrasonics

Bulletin ofMathematical Biology

Conclusions

• Reasonable mapping results• Full-text only (no citations, metadata) gives good results• Scalable to significant size

Future Work

• Proper precision and recall evaluation using same corpus• Validate with NetNews-20 collection for P & R• Evaluate non-metric MDS• Project articles onto semantic journal space & build interactive

discovery interface & evaluate– Index journal 'documents' and journal articles– SV on all– Distance matrix only on journals– Do MDS– Use eigenvectors to transform N-d article vector to 2-D

• Explore 3-D interface (MDS N-d → 3D)

Acknowledgements

• Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-CISTI

License

Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 Canada License

top related