the ‘xml’ project: integrated access to scientific resources miriam blake – presenter
DESCRIPTION
The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter Beth Goldsmith Mariella Di Giacomo Los Alamos National Laboratory Research Library. Rationale for Project. 60+ Million citations – multiple access points Duplicate records / citations - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/1.jpg)
LITA Forum 2003
The ‘XML’ Project:
Integrated Access to Scientific Resources
Miriam Blake – Presenter
Beth Goldsmith
Mariella Di Giacomo
Los Alamos National Laboratory
Research Library
![Page 2: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/2.jpg)
LITA Forum 2003
Rationale for Project
60+ Million citations – multiple access points Duplicate records / citations No links between bibliographies and records we
store Need ‘smart objects’ with pointers (to full-text, etc.) Wanted an updated interface with new features
![Page 3: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/3.jpg)
LITA Forum 2003
Existing Databases at LANL
Citation (A&I) databases ISI
SciSearch 1945-present : ~30 M + 4 k weekly Social SciSearch 1973-present: ~15M + 1k Arts & Humanities 1975-present: ~5M + .5k ISI Proceedings 1990-present : ~3M + .5k
• All ISI dbs have associated citation records
INSPEC : ~8 M + BIOSIS : ~15M Engineering Index (Compendex) Other (DOE, LAUP/tech repts., GeoRef, OPAC, etc.)
![Page 4: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/4.jpg)
LITA Forum 2003
Project Team
6 developers (librarians and programmers) Miriam Blake, Doug Chafe, Mariella Di Giacomo,
Frances Knudson, Beth Goldsmith, Mark Martinez, Ming Yu, Jeff Scott (hardware)
Research Library staff Librarians / metadata experts
Interface team – 2 staff doing jsp, html, graphics for this project part time
![Page 5: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/5.jpg)
LITA Forum 2003
Project Workflow
VendorA
VendorB
VendorC
Multiple vendor record formats
VerityXML
Single record format
Application
VerityIndexes
MySQLIndexes
Indexing
Display Search & browse
Co
nve
rsio
n
![Page 6: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/6.jpg)
LITA Forum 2003
Hardware Fault-tolerant architecture to provide reliability,
flexibility, and speed Sun Solaris 2.8 platform Security environment
Data stored and accessed inside a firewall Data accessed and application runs outside
the firewall Required a data sharing file system (for Solaris)
LSC file system called QFS Multiple readers, one writer per filesystem
![Page 7: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/7.jpg)
LITA Forum 2003
Load
Bal
ance
r
User Authentication Db
(mysql)Linux
Application
Verity Broker
Application
Verity Broker
Verity Servers
MySQL slave server
Application
Verity Broker
Application
Verity Broker
SAN (Storage Area Network)SAN (Storage Area Network)
Verity Colls
Verity Colls
Verity Colls
XML recs
XML recs
XML recs
AuthorBrowse db
AuthorBrowse db
Application
Verity Broker
Verity Servers
Application
Verity Broker
Verity Servers
MySQL slave server
Verity broker/servers
Fir
ewa
ll
Dev
elo
pm
ent
En
viro
nm
ent
![Page 8: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/8.jpg)
LITA Forum 2003
Software components
Verity search engine MySQL to handle author browse, user functions Interface
XSLT to transform XML for query result displays
Java servlets JSP Apache / tomcat to handle Java/JSP
presentation
![Page 9: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/9.jpg)
LITA Forum 2003
Verity Search Engine Commercial product – used by many large companies
Used in our older apps – users familiar with search capabilities
Strength in full-text searching Required Solaris (now runs on Linux) Verity K2 – parallel multi-tiered architecture
Brokered approach Searches are distributed to multiple servers to
concurrently search multiple Verity collections LANL collections broken by year
Recs colls – bibliographic metadata Cites colls - citations within articles
![Page 10: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/10.jpg)
LITA Forum 2003
ISI Conversion
ISI vendor record (Bib record + citations)
XML Record with bib data
“recs”
XML Record with citation
data items“cites”
Verity recs coll(for searching)
Verity cites coll(for searching)
![Page 11: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/11.jpg)
LITA Forum 2003
Record Structure Record keys <fullKey> – structure:
Combination of ISSN, author name, volume, issue, start page, and title letters
/recs/sici00/0018-8190/46/2/173_SCIANCE-LSTROST
Not all elements are always present ISI records split into 2 XML records with the
same fullKey – one for bibliographic and one for citations (bibliography) Bibliographic and cited indexed into separate
collections for searching
![Page 12: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/12.jpg)
LITA Forum 2003
Conversion to XML - recs Verity XML
Specific fields needed to handle vendor indexing requirements
One XML record containing matching articles from multiple vendors
Consistent XML tags across databases as much as possible
Verity XML record exampleExample Verity XML Bib record
![Page 13: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/13.jpg)
LITA Forum 2003
Kludges for Verity Sort fields
<sorttitle>SPIRITUALITY IN MEDICINE A COMPARISON OF MEDICAL STUDENTS ATTITUDES AND CLINICAL PERFORMANCE </sorttitle>
<sortauthor>MUSICK DW CHEEVER TR QUINLIVAN S NORA LM </sortauthor>
<sortsource>ACADEMIC PSYCHIATRY 2003000027000002000000000067</sortsource>
<sortdate>20030000000183296100001</sortdate>
Display Fields <resauthor>(Art)Musick, DW; Cheever, TR; Quinlivan, S; Nora, LM
</resauthor>
<ressource>(Art)Source: ACADEMIC PSYCHIATRY; SUM 2003; v.27, no.2, p.67-73 </ressource>
![Page 14: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/14.jpg)
LITA Forum 2003
Kludges for Verity Zones
<znumber> <issn db="Soc">1042-9670</issn> <controlNum db="Soc">000183296100001</controlNum> </znumber>
Data enhanced tags <zjournal> <journalAbbrJ2 db="Soc">ACAD PSYCHIATR</journalAbbrJ2> <journalAbbrJ9 db="Soc">ACAD PSYCHIATRY</journalAbbrJ9> <journalAbbr db="Soc">Acad. Psych.</journalAbbr> <journalAbbrJ1 db="Soc">ACAD PSYCHI</journalAbbrJ1> <journal db="Soc">ACADEMIC PSYCHIATRY; SUM 2003; v.27, no.2,
p.67-73</journal></zjournal>
![Page 15: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/15.jpg)
LITA Forum 2003
Unified record display
Preference order for fields to display when multiple databases are present in the same record Some fields should be dedupped (e.g. title) Some fields should display all data from all
databases (e.g. subject, keywords) Becomes critical when multiple vendor records
are displayed together
![Page 16: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/16.jpg)
LITA Forum 2003
Unified record display
![Page 17: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/17.jpg)
LITA Forum 2003
ISI Conversion - Cites
Cites – citation data (bibliographies) in each bibliographic record Searchable separately from the articles which cite
them 500+ Million individual citations (~170M are unique) Can be search by cited author, source, year, volume
or a combination thereof One cites XML record can have multiple citations
<refItem> - one for each citation After conversion to XML, fullKeys created for each
<refItem> where possible
![Page 18: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/18.jpg)
LITA Forum 2003
ISI Conversion - cites
Title: xxx
-----------------------
Citation 1
Citation 2
Citation 3
Citation x…
3 authorsLing, TW (1st author)
Goh, CHLee, ML
Source (title, year, vol, issue, start page)
Information and software technology1996, v.38, # 9, p.601
26 M records with bibliographies 500 M individual citations <refItem>
FullKey for this item:/recs/sici09/0950-5849/39/9/601_LING-ECFDFPDD
![Page 19: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/19.jpg)
LITA Forum 2003
Cites Fuzzy Matching
Every <refItem> is processed to try to link it to the recs article it matches using fullKey Use “fuzzy matching” rules developed
internally Internal db of ISSNs matches brief source
data ( PHYS REV B or P REV B) ISSN + cited author name, cited volume, cited
page creates fullKey that can match to the key of a bib record, creating a link
~60% of bib records match a cite
![Page 20: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/20.jpg)
LITA Forum 2003
XML Cited reference example <refItem type="ref">
<fullKey>/recs/sici10/1040-2446/67/1/42_VU-6YCCPBAUSPASIUS</fullKey>
<starKey>/recs/sici10/1040-2446/67/*/42_VU*</starKey>
<citAu src="cit">VU, NV</citAu>
<citAu src="bib">VU, NV</citAu>
<citAu src="bib">BARROWS, HS</citAu>
<citAu src="bib">TRAVIS, T</citAu>
<citSo src="cit">ACAD MED</citSo>
<citSo src="bib">ACADEMIC MEDICINE</citSo>
<citSo src="bib">ACAD MED</citSo>
<citYear src="cit">1992</citYear>
<citVol src="cit">67</citVol>
<citIssue src="bib">1</citIssue>
<citPage src="cit">42</citPage>
<citEndPg src="bib">50</citEndPg>
<citIssn src="bib">1040-2446</citIssn>
</refItem>
![Page 21: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/21.jpg)
LITA Forum 2003
Matching citations and bib records
Sample bibliography
No record match found
Match on key /recs/sici01/0163-5808/29/3/76_LEE-CASXSL
Match on key /recs/sici09/0950-5849/38/9/601_LING-ECFDFPDD
![Page 22: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/22.jpg)
LITA Forum 2003
Cited browse
“citeinfo” database with over ½ billion individual citations (one of largest MySQL db’s around!)
Individual <refItem>s include fullKeys (which come from cite XML) for linking
FullKeys are de-dupped Each cited author name is pulled from
<refItem>s, normalized and added browselist tables Browse tables contain ~195 Million names After dedupping, only ~12 M unique names
![Page 23: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/23.jpg)
LITA Forum 2003
Cited browse
12 M unique names-Browse cited papers-Browse general search
Number of times each item is cited
Links to record via fullKey
Total cite count
![Page 24: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/24.jpg)
LITA Forum 2003
Times cited
<fullKey> is used to create real-time times-cited counts
Counts displayed in bibliographic record as well as cited browse
Times-cited count is also pulled out and indexed into verity to allow sorting of results by “times cited”
![Page 25: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/25.jpg)
LITA Forum 2003
Times cited
![Page 26: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/26.jpg)
LITA Forum 2003
Cited linkages
Full Record
Title: A Published: 2000 -----------------------
Citation 1
Citation 2
Citation 3
Citation x…
Full Record Citation 1
Times cited: 96
Full RecordCitation 1
Number of times cited: Total 96 2003: 12 2002: 23 2001: 24 2000: 50 1999: 70 1998: 17
Records citingCitation 1 Published in 2000-------------------------Title ATitle BTitle C …
![Page 27: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/27.jpg)
LITA Forum 2003
Cited browse
Connections to citeinfo MySQL use connection pooling 100 connections refreshed after every 10
queries (can be increased on the fly) Table structure optimizations reduced browse
time to avg. under 1 second Highly cited works (cited more than 10,000
times) are slow
![Page 28: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/28.jpg)
LITA Forum 2003
Adding MySQL to the mix
Fast performance and an Open Source relational db
On Sun platform, can address up to 32GB of memory for query caching
Used to provide browse capability for article authors / cited authors
Also provides a live, disk based backup to XML bibliographic data
Separate MySQL databases used for User authentication and preferences and for current alerts services
![Page 29: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/29.jpg)
LITA Forum 2003
Application - Requirements
250,000+ searches per month3300 users have weekly alerts set up115 run saved searches “on demand”Access requests from all over the world
National Inst. Of Materials Physics – Bucharest-RomaniaUniv. Program in Ecology – Duke UniversityDept. of Biochemistry and Molecular Biology – U of
Western AustraliaNational Center for Atmospheric Research – Boulder, CO
![Page 30: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/30.jpg)
LITA Forum 2003
Application - Requirements
Interface enhancements Keep “successful” options from legacy interfaces Add features based on user feedback
Search screen options - features based on appropriate dbs Alerts and saved searches User preferences Marking and output SFX
![Page 31: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/31.jpg)
LITA Forum 2003
Options in the Interface
![Page 32: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/32.jpg)
LITA Forum 2003
Performance
Many variables – attempts to improve each component XML layout on the filesystems Memory use Network infrastructure Application issues
• MySQL engine, Verity engine, JVM, Java compiler, XSL, and JSP
• Application Code itself
![Page 33: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/33.jpg)
LITA Forum 2003
Lessons Learned
As deadlines approach, design suffers Standards evolve slower than software As projects become bigger, teams need to
formalize work patterns Project Management tools are critical – ant,
CVS, Bugzilla
![Page 34: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter](https://reader035.vdocuments.site/reader035/viewer/2022070409/5681449d550346895db1511b/html5/thumbnails/34.jpg)
LITA Forum 2003
Next steps INSPEC will be added to ISI October 2003
Some interface rework to handle• INSPEC “only” users – no cited features• New / expanded list of indexes• Searches over INSPEC db only (not ISI)
BIOSIS by the end of 2003 Merging User databases across product suite Expanding into a “component architecture”
Increase use of standards and open source(MARCXML, OAI, etc.)