Seamless Searching of Numeric and Textual Resources
Funded by a National Library Leadership Grant from
the Institute of Museum and Library Services
Michael Buckland, Aitao Chen, Fredric Gey and Ray Larson
Friday Afternoon Seminar, Feb 14, 2003
http://metadata.sims.berkeley.edu/papers/SeamlessSearchFinalReport.pdf
From numbers to texts:
Iritani, Evelyn. "Normalizing ties to Vietnam important steps for U.S. firms; California stands to profit handsomely when barriers fall to trade with fast-growing country." Los Angeles Times v114 (July 12, 1995):D1.
An article found using the keywords“Import” and “Vietnam” as query.
From text to numbers:
"U.S. bans import of most European meat". Los Angeles Times v116, n14 (Dec 14, 1997):A22. (On fear of mad cow disease.) "Ban on cattle and sheep is extended to all Europe." New York Times v147, sec1 (Dec 14, 1997):16(N), 42(L). (The U.S. Agriculture Department responds to threat of 'Mad Cow' disease).
Topic of interest: imports of beef to the United States from Britain
The sources at http://govinfo.kerr.orst.edu/import/import.html show
No reported edible beef imports from the United Kingdom.
Seamless Search Project Goals:
• Phase I: The development and demonstration of a library gateway providing search support for searching both text and socio-economic numeric databases.
• Phase II: The demonstration of a library gateway supporting searches between text and numeric database.
Data Sets to create Entry Vocabulary Indexes: MELVYL MARC Files
<RECORD>
<001> 73180254 </001>
<245><a>A study of operant conditioning under delayed reinforcement in early infancy</a></245>
<650><a>Infant psychology.</a></650>
<650><a>Operant conditioning.</a><650>
</RECORD>
Number of MARC records in the training data set: ~4,246,000.
Book title
LC Subject Headings
A sample training record extracted from a MARC record.
doc1
doc2
doc3
doc4
doc5
behavior
infant
infancy
psychology
Infant psychology
Operant conditioning
Infant development
Psychology
Parent and child
child
attitude
baby
development
Title Words Doc IDs LCSHs
Statistical association of title words and LCSH
Word to LCSH Entry Vocabulary Index (EVI)
1 alcoholism 7470.462 alcoholic 1745.233 alcohol 709.264 alcoholism and employment 318.265 drug abuse 257.756 alcohol, ethyl 235.137 drinking of alcoholic beverages 151.468 substance abuse 146.04
Rank LCSH Weight
List of the LCSHs that are most closely associated, statistically, with the query word: alcoholism.
Words to LCSH Entry Vocabulary Index (EVI)
1 economic policy 756.90
2 german (west) 645.02
3 switzerland 97.70
4 regional planning 96.39
5 economics 92.14
Rank LCSH Weight
List of LCSHs that are most closely associated, statistically, with the German query word: Wirtschaftspolitik.
Note: The top-ranked LCSH “economic policy” happens to be the English translation of the German word “Wirtschaftspolitik”.
Words to LCSH Entry Vocabulary Index (EVI)
1 peanut 1343.902 cookery (peanut butter) 429.613 cookery (peanuts) 423.474 peanut industry 359.575 peanut butter 316.236 butter 309.367 schulz, charles m 277.308 cookery 197.08
Rank LCSH Weight
List of LCSHs that are most closely associated, statistically, with the phrase peanut butter as a query.
Word to LCSH Entry Vocabulary Index (EVI)
1 world war, 1939-1945 16430.62
2 vietnamese conflict, 1961-1975 15388.68
3 united states 13989.66
4 world war, 1914-1918 8055.60
5 vietnam 6523.90
Rank LCSH Weight
List of LCSHs that are most closely associated with the German query: Vietnam War.
Note: “Vietnam War” is not an established (authorized) LCSH. The established LCSH is “Vietnamese conflict”.
LCSH to Words Entry Vocabulary Index
1 alcohol 13471.942 alcoholism 11715.563 abuse 3708.094 drug 3467.225 drink 2563.536 alcoholic 2534.917 treatment 2349.038 prevention 1263.949 problem 1148.0310 addiction 886.81
Rank Words Weight
List of words that are most closely associated, statistically, with the Library of Congress Subject Heading: Alcoholism.
EVI-based Access to MELVYL
Free-form query
Ranked list of LCSHs
MELVYLZ39.50 SERVER
HTTP/Z39.50Gateway
httpd
evi access
Searchresults
Full MARCrecord
Web server
gatewayaccess
EVI
Web Browser
OtherZ39.50 SERVERS
Z39.50
HTTP
CGI
1 6
5
4
3
2
7
Counting California Database(http://countingcalifornia.cdlib.org/)
• A collection of some 3,000 numeric tables.
• Organized into 16 topics and 184 subtopics.
Sample topics: • Banking, Finance and Insurance• Elections• Population and Demographics• Social Services and Public Assistance
Sample subtopics under Agriculture and Natural Resources: • Farms and Farming• Fishing• Forestry and Lumber• Minerals
Enhanced Access to Counting California Database
• Conventional probabilistic retrieval of numeric tables using table captions, mapping query to text of captions.
• Access to numeric tables through the words-to-subtopic entry vocabulary index.
<table>
<topic> education </topic>
<subtopic> libraries </subtopic>
<caption>STATISTICS, STATEWIDE SUMMARY BY TYPE OF LIBRARY CALIFORNIA, 1992-93 TO 1997-98</caption>
</table>
A sample record created from http://countingcalifornia.cdlib.org.
Probabilistic Access to Counting California Database
Search results for the query: public libraries in Californiagives ranked list of captions:
EVI-based Access to Counting California Database
Ranked list of subtopics that are most closely associated, statistically, with the query: personal/individual income tax.1 income 542.532 government earnings and tax revenues 251.713 property tax 156.674 property tax 74.585 personal income tax 59.99
Numeric Tables with Subtopic: Personal income tax.
EVI LCSH
marcnew query
search resultscaptions
numeric table
numeric database
online catalog
search interface 1
search interface 2
1
8 7 6
5
432
11
109
Traverse Searching Between Online Catalogs and Numeric Databases
Melvyl MARC record as source of a query
Extract from MARC as a query
Any caption can become a query
http://metadata.sims.berkeley.edu/papers/SeamlessSearchFinalReport.pdf
Final Report on “Seamless Searching of Numeric and Textual Resources” Project, 1999-2002.
Two sequels:
1. Adding search by place: “Going Places in the Catalog: Improved Geographic Access,” funded by a National Library Leadership Project from the Institute of Museum and Library Services, 2002-2004.
2. Multilingual Search Across Multiple Genres: Proposal submitted Feb 13, 2003!