semantic journal mapping for search visualization in a large scale article digital library
DESCRIPTION
In this paper we examine the scalability and utility of semantically mapping (visualizing) journals in a large scale (5.7+ million) science, technology and medical article digital library. This work is part of a larger research effort to evaluate semantic journal and article mapping for search query results refinement and visual contextualization in a large scale digital library. In this work the Semantic Vectors software package is parallelized and evaluated to create semantic distances between 2365 journals, from the sum of their full-text. This is used to create a journal semantic map whose production does scale and whose results are comparable to other maps of the scientific literature.TRANSCRIPT
Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library
Glen Newton1,2, Alison Callahan1, Michel Dumontier2
1National Research Council Canada, 2Carleton University
Second Workshop on Very Large Digital Libraries (VLDL) 2009
at ECDL 2009
Oct 2 2009 Corfu, Greece
Outline
• Maps of Science• Background• Research Interests• Research Goals• Process• Scalability issues• Environment• Results• Conclusions• Future Work
From Bollen et al 2009 PLOS1
From Leydesdorff & Rafols 2006From Leydesdorff & Rafols 2006
From Leydesdorff & Rafols 2006
Background
• Canada Institute of Science and Technical Information (CISTI) == Canadian national science library
• ~3000 active researchers at NRC• Large full text collection of ~8.4m full-text + metadata articles, in
science, technology, medicine (STM)• 4100 journal titles• ~1995 to 2009
Research Interests
• Domain-specific discovery• Improved discovery in STM domains through results visualization
and contextualization, browse/explore/refine• Results set visualization: “mapping”
Research Goals
• Find way to extract journal (& article) semantic vector space• Latent Semantic Analysis (LSA) works for small/medium sized
corpora, does not scale to large scale of items and/or terms• New alternative: Semantic Vectors (SV): uses random vectors &
avoids expensive singular value decomposition (SVD)• Can SV scale & generate sensible semantic vector space of
journals on corpus of this size?• Can the visualization produced be useful for results query
visualization, refinement, discovery?
Corpus
• Licensed journal articles from STM publishers: Elsevier, Springer, etc
• ~4100 journal titles, classified into 23 categories (by librarians)• ~8.4m journal articles• Selection of articles/journals:
– Only those with authors, abstract (no notices, obituaries, etc)– Only English language articles– Only journals with >50 articles in corpus– Resulting corpus: 5,733,721 articles from 2231 journals – Categories overlapping: 1.53 categories per journal
Category # Journals per category
Agriculture & Biological Sciences 358
Arts and Humanities 70
Biochemistry, Genetics and Molecular Biology 240
Business, Management and Accounting 106
Chemical Engineering 126
Chemistry 226
Civil Engineering 64
Computer Science 218
Decision Science 50
Earth and Planetary Science 146
Economics, Econometrics and Finance 112
Category # Journals per category
Energy and Power 73
Engineering and Technology 328
Environmental Science 138
Immunology and Microbiology 104
Materials Science 160
Mathematics 205
Medicine 671
Neuroscience 103
Pharmacology, Toxicology and Pharmaceutics
73
Physics and Astronomy 210
Psychology 126
Social Science 222
Process
• Index full-text (only) with Lucene 2.4, aggressive stopword list, Porter stemming using LuSql tool
• Build Semantic Vectors (v1.18, parallelized) index from Lucene index, with 512 semantic dimensions
• Find item x item distance matrix from SV index of 512-dimensional vectors
• Using R, use multidimensional scaling (MDS) to reduce from 512-D to 2-D
Scalability Issues
• #items, #unique terms– #unique terms: SV easily handles very well– #items: SV handles fairly well – #items: impacts size of distance matrix (#items x #items) – R cannot handle huge article distance matrix in MDS (i.e.
millions of articles vs. thousands of journals)
• Instead of using articles for items, use journals for items• Make single large full-text document from concatenation of all
articles of particular journal & index these
Environment
• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM, attached to a Dell EMC AX150 storage arrays via SilkWorm 200E Series 16-Port Capable 4Gb Fabric Switch.
• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel 2.6.18.8-0.10-default \#1 SMP
• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-BitServer VM (build 10.0-b23, mixed mode).
• Processing 1.0 (processing.org)
Results: Scalability
• Corpus: ~600GB full-text• Lucene index: 43GB
– LuSql: 13 hours 51 minutes to produce
• SV index: 58 minutes, 885 MB, 21.6m terms– Distance matrix: 6 minutes
Results: Visualization
• Using Processing environment, built simple validation/visualization tool
Harder sciences and engineering categories
Chemistry
Material Science
Physics andAstronomy
Engineering and Technology
Mathematics
Computer Science
Civil Engineering
Chemical Engineering
Agriculture and biomedical categories
Agriculture and Biological Sciences
Biochemistry, Geneticsand Molecular Biology
Immunology andMicrobiology
Pharmacology
Neuroscience
MedicineMedicine
Psychology
Interdisciplinary and non-science categories
Environmental Science
Earth andPlanetary Science
Energy and Power
Decision Science
Economics,EconometricsAnd Finance
Social Sciences
Business, Managementand Accounting
Arts and Humanities
Examination of outliers, extrema and cataloging errors
Environmental Science
Ecotoxicology and Environmental Safety
Corporate EnvironmentalStrategy
Organic Geochemistry
MedicineMedicine
Journal of X-Ray Science and Technology
Journal of Biomolecular NMR
MedicineMedicine
Annales Henri Poincare
Colloidal and Polymer Science
Medicine
Medicine
French language Medical & Psychology Journals
Mathematics
Journal ofMedicalUltrasonics
Bulletin ofMathematical Biology
Conclusions
• Reasonable mapping results• Full-text only (no citations, metadata) gives good results• Scalable to significant size
Future Work
• Proper precision and recall evaluation using same corpus• Validate with NetNews-20 collection for P & R• Evaluate non-metric MDS• Project articles onto semantic journal space & build interactive
discovery interface & evaluate– Index journal 'documents' and journal articles– SV on all– Distance matrix only on journals– Do MDS– Use eigenvectors to transform N-d article vector to 2-D
• Explore 3-D interface (MDS N-d → 3D)
Acknowledgements
• Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-CISTI
Demo
• Link to project demo page
License
Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 Canada License