march 2, 2004 ray r. larson cheshire ii: features and internals and cheshire iii overview ray r....

132
March 2, 2004 Ray R. Larson Cheshire II: Features Cheshire II: Features and Internals and Internals and Cheshire III and Cheshire III overview overview Ray R. Larson Ray R. Larson School of Information School of Information Management and Systems Management and Systems University of California, University of California, Berkeley Berkeley

Upload: carissa-dumbleton

Post on 02-Apr-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Cheshire II: Features and InternalsCheshire II: Features and Internalsand Cheshire III overviewand Cheshire III overview

Cheshire II: Features and InternalsCheshire II: Features and Internalsand Cheshire III overviewand Cheshire III overview

Ray R. LarsonRay R. LarsonSchool of Information Management and School of Information Management and

Systems Systems

University of California, BerkeleyUniversity of California, Berkeley

Page 2: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

OverviewOverviewOverviewOverview

• Cheshire II feature overview Cheshire II feature overview – Logistic Regression Ranking, Okapi BM-25 and Logistic Regression Ranking, Okapi BM-25 and

Boolean OperationsBoolean Operations

– Fusion OperatorsFusion Operators

• Additions from INEX ‘03Additions from INEX ‘03– Element/Index level re-estimation of LR coefficientsElement/Index level re-estimation of LR coefficients

• Adhoc and Heterogeneous Track MethodologyAdhoc and Heterogeneous Track Methodology• Evaluation Results -AdhocEvaluation Results -Adhoc

Page 3: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Overview of Cheshire IIOverview of Cheshire IIOverview of Cheshire IIOverview of Cheshire II• It supports SGML and XML with components and component indexesIt supports SGML and XML with components and component indexes• It is a client/server applicationIt is a client/server application• Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI,

SOAP, SDLIP also implementedSOAP, SDLIP also implemented• Server supports a Relational Database GatewayServer supports a Relational Database Gateway• Supports Boolean searching of all serversSupports Boolean searching of all servers• Supports probabilistic ranked retrieval in the Cheshire search engine as Supports probabilistic ranked retrieval in the Cheshire search engine as

well as Boolean and proximity searchwell as Boolean and proximity search• Search engine supports ``nearest neighbor'' searches and relevance Search engine supports ``nearest neighbor'' searches and relevance

feedbackfeedback• GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT• WWW/CGI forms interface for DL, using combined client/server CGI WWW/CGI forms interface for DL, using combined client/server CGI

scripting via WebCheshirescripting via WebCheshire• Scriptable clients using Tcl and PythonScriptable clients using Tcl and Python• Store SGML/XML as files or “Datastore” databaseStore SGML/XML as files or “Datastore” database

Page 4: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Cheshire II SearchingCheshire II SearchingCheshire II SearchingCheshire II Searching

Z39.50 Internet

ImagesScannedText

Local Remote

Z39.50

Z39.50

Z39.50

Page 5: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX OverviewINEX OverviewINEX OverviewINEX Overview

LocalNet

UIOr

Scripts

MapQuery

MapResults

MapQuery

MapResults

INEXSearchEngine

Page 6: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Boolean Search CapabilityBoolean Search CapabilityBoolean Search CapabilityBoolean Search Capability• All Boolean operations are supportedAll Boolean operations are supported

– ““zfind author x and (title y or subject z) not subject zfind author x and (title y or subject z) not subject A”A”

• Named sets are supported and stored on the Named sets are supported and stored on the serverserver

• Boolean operations between stored sets are Boolean operations between stored sets are supportedsupported– ““zfind SET1 and subject widgets or SET2”zfind SET1 and subject widgets or SET2”

• Nested parentheses and truncation are supportedNested parentheses and truncation are supported– ““zfind xtitle Alice#”zfind xtitle Alice#”

Page 7: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Probabilistic RetrievalProbabilistic RetrievalProbabilistic RetrievalProbabilistic Retrieval• Uses Logistic Regression ranking method developed Uses Logistic Regression ranking method developed

at Berkeley (W. Cooper, F. Gey, D. Dabney, A. at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at Chen) with new algorithm for weigh calculation at retrieval timeretrieval time

• Z39.50 “relevance” operator used to indicate Z39.50 “relevance” operator used to indicate probabilistic searchprobabilistic search

• Any index can have Probabilistic searching Any index can have Probabilistic searching performed:performed:– zfind topic @ “cheshire cats, looking glasses, march hares zfind topic @ “cheshire cats, looking glasses, march hares

and other such things”and other such things”– zfind title @ caucus raceszfind title @ caucus races

• Boolean and Probabilistic elements can be Boolean and Probabilistic elements can be combined:combined:– zfind topic @ government documents and title zfind topic @ government documents and title

guidebooksguidebooks

Page 8: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

P(R | Q,D) b0 biX i

i1

6

Probability of relevance is based onProbability of relevance is based onLogistic regression from a sample set of documentsLogistic regression from a sample set of documentsto determine values of the coefficients. to determine values of the coefficients. At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:

For the 6 For the 6 XX attribute measures shown on the next slide attribute measures shown on the next slide

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

Page 9: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query FrequencyAverage Absolute Query Frequency

Query LengthQuery Length

Average Absolute Component FrequencyAverage Absolute Component Frequency

Document LengthDocument Length

Average Inverse Component FrequencyAverage Inverse Component Frequency

Inverse Component FrequencyInverse Component Frequency

Number of Terms in common between Number of Terms in common between query and Component -- logged query and Component -- logged

Page 10: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements

Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements

• Two original approaches:Two original approaches:– Boolean ApproachBoolean Approach

– Non-probabilistic “Fusion Search” Set merger approach Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate is a weighted merger of document scores from separate Boolean and Probabilistic queries Boolean and Probabilistic queries

P(R | Q,D) P(R | Qbool ,D)P(R | Qprob ,D)

P(R | Qbool ,D) 1: if Boolean eval successful for D

0 : Otherwise

Page 11: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Okapi BM25Okapi BM25Okapi BM25Okapi BM25

• Where:Where:• QQ is a query containing terms is a query containing terms TT• K K is is kk11((1-((1-bb) + ) + b.dlb.dl//avdlavdl))• kk11, b , b and and kk33 are parameters , usually 1.2, 0.75 and 7-1000are parameters , usually 1.2, 0.75 and 7-1000• tftf is the frequency of the term in a specific document is the frequency of the term in a specific document• qtf qtf is the frequency of the term in a topic from which is the frequency of the term in a topic from which QQ was derived was derived• dl dl and and avdl avdl are the document length and the average document length are the document length and the average document length

measured in some convenient unitmeasured in some convenient unit• ww(1) (1) is the Robertson-Sparck Jones weight.is the Robertson-Sparck Jones weight.

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

Page 12: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Merging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking Operators• Extends the capabilities of merging to include Extends the capabilities of merging to include

merger operations in queries like Boolean operatorsmerger operations in queries like Boolean operators• Fuzzy Logic Operators (not used for INEX)Fuzzy Logic Operators (not used for INEX)

– !FUZZY_AND!FUZZY_AND– !FUZZY_OR!FUZZY_OR– !FUZZY_NOT!FUZZY_NOT

• Containment operators: Restrict components to or Containment operators: Restrict components to or with a particular parent with a particular parent – !RESTRICT_FROM!RESTRICT_FROM– !RESTRICT_TO!RESTRICT_TO

• Merge OperatorsMerge Operators– !MERGE_SUM!MERGE_SUM– !MERGE_MEAN!MERGE_MEAN– !MERGE_NORM!MERGE_NORM– !MERGE_CMBZ!MERGE_CMBZ

Page 13: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Subquery

INEX ‘04 Fusion SearchINEX ‘04 Fusion SearchINEX ‘04 Fusion SearchINEX ‘04 Fusion Search

• Merge multiple ranked and Boolean index Merge multiple ranked and Boolean index searches within each query and multiple searches within each query and multiple component search resultsetscomponent search resultsets– Major components merged are Articles, Body, Major components merged are Articles, Body,

Sections, subsections, paragraphsSections, subsections, paragraphs

Subquery

Subquery

Subquery

Comp.QueryResultsComp.

QueryResults

Fusion/Merge

FinalRanked

List

Page 14: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

New LR CoefficientsNew LR CoefficientsNew LR CoefficientsNew LR CoefficientsIndexIndex b0b0 b1b1 b2b2 b3b3 b4b4 b5b5 b6b6

BaseBase -3.700-3.700 1.2691.269 -0.310-0.310 0.6790.679 -0.021-0.021 0.2230.223 4.0104.010

topictopic -7.758-7.758 5.6705.670 -3.427-3.427 1.7871.787 -0.030-0.030 1.9521.952 5.8805.880

topicshorttopicshort -6.364-6.364 2.7392.739 -1.443-1.443 1.2281.228 -0.020-0.020 1.2801.280 3.8373.837

abstractabstract -5.892-5.892 2.3182.318 -1.364-1.364 0.8600.860 -0.013-0.013 1.0521.052 3.6003.600

alltitlesalltitles -5.243-5.243 2.3192.319 -1.361-1.361 1.4151.415 -0.037-0.037 1.1801.180 3.6963.696

sec wordssec words -6.392-6.392 2.1252.125 -1.648-1.648 1.1061.106 -0.075-0.075 1.1741.174 3.6323.632

para para wordswords

-8.632-8.632 1.2581.258 -1.654-1.654 1.4851.485 -0.084-0.084 1.1431.143 4.0044.004

Estimates using INEX ‘03 relevance assessments forEstimates using INEX ‘03 relevance assessments forb1 = Average Absolute Query Frequencyb1 = Average Absolute Query Frequencyb2 = Query Lengthb2 = Query Lengthb3 = Average Absolute Component Frequencyb3 = Average Absolute Component Frequencyb4 = Document Lengthb4 = Document Lengthb5 = Average Inverse Component Frequencyb5 = Average Inverse Component Frequencyb6 = Number of Terms in common between queryb6 = Number of Terms in common between query and Component and Component

Page 15: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• Underlying native format for all data is SGML or Underlying native format for all data is SGML or XMLXML

• The DTD defines the file format for each fileThe DTD defines the file format for each file• Full SGML/XML parsingFull SGML/XML parsing• SGML/XML Format Configuration Files define SGML/XML Format Configuration Files define

the databasethe database• USMARC DTD and MARC to SGML conversion USMARC DTD and MARC to SGML conversion

(and back again)(and back again)• Access to full-text via special SGML/XML tagsAccess to full-text via special SGML/XML tags

Page 16: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

IndexingIndexingIndexingIndexing

• Any SGML/XML tagged field or attribute can be Any SGML/XML tagged field or attribute can be indexed:indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)B-Tree and Hash access via Berkeley DB (Sleepycat)

– Stemming, keyword, exact keys and “special keys”Stemming, keyword, exact keys and “special keys”

– Mapping from any Z39.50 Attribute combination to a Mapping from any Z39.50 Attribute combination to a specific indexspecific index

– Underlying postings information includes term Underlying postings information includes term frequency for probabilistic searchingfrequency for probabilistic searching

• Component extraction with separate component Component extraction with separate component indexesindexes

Page 17: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

XML Element ExtractionXML Element ExtractionXML Element ExtractionXML Element Extraction

• A new search “ElementSetName” is A new search “ElementSetName” is XML_ELEMENT_XML_ELEMENT_

• Any Xpath, element name, or regular Any Xpath, element name, or regular expression can be included following the expression can be included following the final underscore when submitting a present final underscore when submitting a present requestrequest

• The matching elements are extracted from The matching elements are extracted from the records matching the search and the records matching the search and delivered in a simple format..delivered in a simple format..

Page 18: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

XML ExtractionXML ExtractionXML ExtractionXML Extraction

% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…

Page 19: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• Configuration files for the Server are Configuration files for the Server are SGML/XML:SGML/XML:– They include elements describing all of the data They include elements describing all of the data

files and indexes for the database.files and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to

be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.

Page 20: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support• Example XML record for a DL documentExample XML record for a DL document

<ELIB-BIB><BIB-VERSION>ELIB-v1.0</BIB-VERSION><ID>756</ID><ENTRY>June 12, 1996</ENTRY><DATE>June 1996</DATE><TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE><ORGANIZATION>University of California</ORGANIZATION><TYPE>report</TYPE><AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL><AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL><PROJECT>SNEP</PROJECT><SERIES>Vol 3</SERIES><PAGES>40</PAGES><TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF><PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF></ELIB-BIB>

Page 21: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...

SGML SupportSGML SupportSGML SupportSGML Support

• Example SGML/MARC RecordExample SGML/MARC Record

Page 22: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML SupportTREC document…TREC document…

<DOC><DOCNO>FT931-3566</DOCNO><PROFILE>_AN-DCPCCAA3FT</PROFILE><DATE>930316</DATE><HEADLINE>FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key tounlocking Tangentopoli - They will set the investigation agenda</HEADLINE><BYLINE> By ROBERT GRAHAM</BYLINE><TEXT>OVER the weekend the Italian media felt obliged to comment on a non-event.No new arrests had taken place in any of the country's ever more numerouscorruption scandals which centre on the illicit funding of political parties...</TEXT><XX> …

Page 23: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

…Companies:-</XX><CO>Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale.</CO><XX>Countries:-</XX><CN>ITZ Italy, EC.</CN><XX>Industries:-</XX><IN>P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC.</IN><XX>Types:-</XX> …

Page 24: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

<TP>CMMT Comment &amp; Analysis.

GOVT Legal issues.

</TP>

<PUB>The Financial Times

</PUB>

<PAGE>

London Page 4

</PAGE>

</DOC>

Page 25: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• INEX DocumentINEX Document<article><fno>C1050</fno><doi>10.1041/C1050s-2000</doi><fm><hdr><hdr1><ti>COMPUTING IN SCIENCE &amp; ENGINEERING</ti><crt><issn>1521-9615</issn>/00/$10.00 <cci><onm>&copy; 2000 IEEE</onm></cci></crt></hdr1><hdr2><obi><volno>Vol. 2</volno><issno>No. 1</issno></obi><pdt><mo>JANUARY/FEBRUARY</mo><yr>2000</yr></pdt><pp>pp. 50-59</pp></hdr2></hdr><tig><atl>The Decompositional Approach to Matrix Computation</atl><pn>pp. 50-59</pn></tig><au sequence="first"><fnm>G.W.</fnm><snm>Stewart</snm><aff><onm>University of Maryland</onm></aff></au><fig><art file="c1050x1.gif" w="425" h="321" tw="150" th="113"/></fig><abs><p>The introduction of matrix decomposition into numerical linear algebra revolutionized matrix computations. This article outlines the decompositional approach, comments on its history, and surveys the six most widely used decompositions.</p></abs></fm><bdy><sec><st></st><ip1>In 1951, Paul S. Dwyer published <it>Linear Computations</it>, perhaps the first book devoted entirely to numerical linear algebra.<ref rid="bibc10501" type="bib">1</ref> Digital computing was in its infancy, and Dwyer focused on computation with mechanical calculators. Nonetheless, the book was state of the art. <ref rid="c10501" type="fig">Figure 1</ref> reproduces a page of the book dealing with Gaussian elimination. In 1954, Alston S. Householder published <it>Principles of Numerical Analysis</it>,<ref rid="bibc10502" type="bib">2</ref> one of the first modern treatments of high-speed digital computation. <ref rid="c10502" type="fig">Figure 2</ref> reproduces a page from this book, also dealing with Gaussian elimination.</ip1><fig id="c10501"><art file="c10501.gif" w="600" h="970" tw="150" th="243"/><no>1</no><fgc>This page from <it>Linear Computations</it> shows that Paul Dwyer's approach begins with a system of scalar equations. Courtesy of John Wiley &amp; Sons.</fgc></fig><fig id="c10502"><art file="c10502.gif" w="500" h="807" tw="150" th="242"/><no>2</no><fgc>On this page from <it>Principles of Numerical Analysis</it>, Alston Householder uses partitioned matrices and LU decomposition. Courtesy of McGraw-Hill.</fgc></fig><p>The contrast between these two excerpts is striking. The most obvious difference is that Dwyer used scalar equations whereas Householder used partitioned matrices. …

Page 26: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support…<sec><st>CONCLUSION</st><ip1>The big six are not the only decompositions in use; in fact, there are many more. As mentioned earlier, certain intermediate forms&mdash;such as tridiagonal and Hessenberg forms&mdash;have come to be regarded as decompositions in their own right. Since the singular value decomposition is expensive to compute and not readily updated, rank-revealing alternatives have received considerable attention.<ref rid="bibc105054" type="bib">54</ref><super>,</super><ref rid="bibc105055" type="bib">55</ref> There are also generalizations of the singular value decomposition and the Schur decomposition for pairs of matrices.<ref rid="bibc105056" type="bib">56</ref><super>,</super><ref rid="bibc105057" type="bib">57</ref> All crystal balls become cloudy when they look to the future, but it seems safe to say that as long as new matrix problems arise, new decompositions will be devised to solve them.</ip1></sec></bdy><bm><ack><h>Acknowledgment</h><ip1><it>This work was supported by the National Science Foundation under Grant No. 970909-8562.</it></ip1></ack><bib><bibl><h>References</h><bb id="bibc10501"><au><fnm>P.S.</fnm><snm>Dwyer</snm></au><ti>Linear Computations,</ti><obi>John Wiley &amp; Sons,</obi><loc><cty>New York,</cty></loc><pdt><yr>1951.</yr></pdt></bb><bb id="bibc10502"><au><fnm>A.S.</fnm><snm>Householder</snm></au><ti>Principles of Numerical Analysis,</ti><obi>McGraw-Hill,</obi><loc><cty>New York,</cty></loc><pdt><yr>1953.</yr></pdt></bb><bb id="bibc10503"><au><fnm>J.H.</fnm><snm>Wilkinson</snm></au><obi>and</obi><au><fnm>C.</fnm><snm>Reinsch</snm></au><ti>Handbook for Automatic Computation, Vol. II, Linear Algebra,</ti><obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt><yr>1971.</yr></pdt></bb><bb id="bibc10504"><au><fnm>B.S.</fnm><snm>Garbow</snm></au><obi>et al.,</obi><atl>"Matrix Eigensystem Routines&mdash;Eispack Guide Extension,"</atl><ti>Lecture Notes in Computer Science,</ti><obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt><yr>1977.</yr></pdt></bb><bb id="bibc10505"><au><fnm>J.J.</fnm><snm>Dongarra</snm></au><obi>et al.,</obi><ti>LINPACK User's Guide,</ti><obi>SIAM,</obi><loc><cty>Philadelphia,</cty></loc><pdt><yr>1979.</yr></pdt></bb> …

Page 27: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• INEX CAS QueryINEX CAS Query<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE inex_topic SYSTEM "topic.dtd"><inex_topic topic_id="70" query_type="CAS" ct_no="49"><title> /article[about(./fm/abs,'"information retrieval" "digital libraries"')]</title><description>Retrieve articles with an abstract indicating the articleis about information retrieval and/or digital libraries</description><narrative>To be relevant the retrieved articles must be aboutinformation retrieval, digital libraries or, preferably both. Articlesabout information retrieval from digital libraries will receive thehighest relevance judgements.</narrative><keywords>information retrieval,digital libraries</keywords></inex_topic>

Page 28: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• Configuration files for the Server are also Configuration files for the Server are also SGML/XML:SGML/XML:– They include tags describing all of the data files They include tags describing all of the data files

and indexes for the database.and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to

be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.

Page 29: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Cheshire Configuration FilesCheshire Configuration FilesCheshire Configuration FilesCheshire Configuration Files<!-- ******************************************************************* --><!-- ************************* TREC INTERACTIVE TEST DB **************** --><!-- ******************************************************************* --><!-- This is the config file for the Cheshire II TREC interactive Database --><DBCONFIG><DBENV>/projects/is240/GroupX/indexes </DBENV>

<!-- --><!-- TREC TEST DATABASE FILEDEF --><!-- -->

<!-- The Interactive TREC Financial Times datafile --><FILEDEF TYPE=SGML>

<DEFAULTPATH>/projects/is240/GroupX </DEFAULTPATH>

<!-- filetag is the "shorthand" name of the file --><FILETAG> trec </FILETAG>

<!-- filename is the full path name of the main data directory --><FILENAME> /projects/is240/ft </FILENAME>

<CONTINCLUDE> /projects/is240/ft.CONT </CONTINCLUDE>

<!-- fileDTD is the full path name of the file's DTD --><FILEDTD> /projects/is240/TREC.FT.DTD </FILEDTD><!-- assocfil is the full path name of the file's Associator --><ASSOCFIL> ft.assoc </ASSOCFIL>

<!-- history is the full path name of the file's history file --><HISTORY> cheshire_index/TESTDATA.history </HISTORY>…

Page 30: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

IndexingIndexingIndexingIndexing

• Any SGML/XML tagged field or attribute can be Any SGML/XML tagged field or attribute can be indexed:indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)B-Tree and Hash access via Berkeley DB (Sleepycat)– Stemming, keyword, exact keys and “special keys”Stemming, keyword, exact keys and “special keys”– Mapping from any Z39.50 Attribute combination to a Mapping from any Z39.50 Attribute combination to a

specific indexspecific index– Underlying postings information includes term frequency Underlying postings information includes term frequency

for probabilistic searching.for probabilistic searching.– SGML may include address of full-text for indexingSGML may include address of full-text for indexing

• New indexes can be easily added, or old ones deletedNew indexes can be easily added, or old ones deleted

Page 31: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Bitmapped IndexesBitmapped IndexesBitmapped IndexesBitmapped Indexes

• Bitmap indexes can be used for Boolean Bitmap indexes can be used for Boolean operations where the data has only a few operations where the data has only a few values and very large numbers of items with values and very large numbers of items with each valueeach value

• Only one bit per record stored in the indexOnly one bit per record stored in the index

• Processed on a demand basis so only blocks Processed on a demand basis so only blocks with the bits needed to resolve a query are with the bits needed to resolve a query are fetchedfetched

Page 32: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

<!-- The following are the index definitions for the file --><INDEXES>

<!-- ******************************************************************* --><!-- ************************* DOC NO. ********************************* --><!-- ******************************************************************* --><!-- The following provides document number access. --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE PRIMARYKEY=IGNORE><INDXNAME> cheshire_index/trec.docno.index </INDXNAME><INDXTAG> docno </INDXTAG>

<INDXMAP><USE> 12 </USE><struct> 1 </struct> </INDXMAP>

<INDXMAP><USE> 12 </USE><struct> 2 </struct> </INDXMAP>

<INDXMAP><USE> 12 </USE><struct> 6 </struct> </INDXMAP>

<INDXKEY><TAGSPEC><FTAG>DOCNO </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>…

Page 33: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

<!-- ******************************************************************* --><!-- ************************* TOPIC *********************************** --><!-- ******************************************************************* --><!-- The following is the primary index for probabilistic searches --><!-- It includes headlines, datelines, bylines, and full text --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> cheshire_index/trec.topic.index </INDXNAME><INDXTAG> topic </INDXTAG>

<INDXMAP><USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXMAP><USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>…<STOPLIST> cheshire_index/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>HEADLINE </FTAG><FTAG>DATELINE </FTAG><FTAG>BYLINE </FTAG><FTAG>TEXT </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 34: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Cheshire II – EVI GenerationCheshire II – EVI GenerationCheshire II – EVI GenerationCheshire II – EVI Generation

• Entry Vocabulary Indexes can improve access to data with Entry Vocabulary Indexes can improve access to data with controlled index termscontrolled index terms

• Define Define basisbasis for clustering records. for clustering records.– Select field to form the Select field to form the basis basis of the cluster.of the cluster.– EvidenceEvidence Fields to use as contents of the pseudo-documents. Fields to use as contents of the pseudo-documents.

• During indexing cluster keys are generated with During indexing cluster keys are generated with basisbasis and and evidenceevidence from each record. from each record.

• Cluster keys are sorted and merged on basis and pseudo-Cluster keys are sorted and merged on basis and pseudo-documents created for each unique documents created for each unique basisbasis element element containing all evidence fields.containing all evidence fields.

• Pseudo-Documents (Class clusters) are indexed on Pseudo-Documents (Class clusters) are indexed on combined evidence fields.combined evidence fields.

Page 35: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

EVI/Cluster DefinitionsEVI/Cluster DefinitionsEVI/Cluster DefinitionsEVI/Cluster Definitions<!-- ************************* CLUSTER ********************************* --><!-- *********************** DEFINITIONS ******************************* -->

<CLUSTER><clusname> classcluster </clusname><cluskey normal=CLASSCLUS>

<tagspec><FTAG>FLD950 </FTAG> <s> ^a </s>

</tagspec></cluskey><stoplist> /usr3/cheshire2/data2/clasclusstoplist </stoplist><clusmap>

<from> <tagspec><ftag>FLD245</ftag><s>^[ab]</s><ftag>FLD440</ftag><s>^a</s><ftag>FLD490</ftag><s>^a</s><ftag>FLD830</ftag><s>^a</s><ftag>FLD740</ftag><s>^a</s>

</tagspec></from><to> <tagspec>

<ftag>titles</ftag> </tagspec></to><from> <tagspec>

<ftag>FLD6..</ftag><s>^[abcdxyz]</s> </tagspec></from><to> <tagspec>

<ftag>subjects</ftag> </tagspec></to><summarize> <maxnum> 5 </maxnum>

<tagspec> <ftag>subjsum</ftag></tagspec></summarize>

</clusmap></CLUSTER>

Page 36: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Component Extraction and Component Extraction and IndexingIndexing

Component Extraction and Component Extraction and IndexingIndexing

• Any element (or range of SGML/XML data Any element (or range of SGML/XML data starting with one element and ending with starting with one element and ending with another) can be defined as a ‘component’ another) can be defined as a ‘component’ and accessed and indexed as if it were an and accessed and indexed as if it were an entire document.entire document.

• Component indexes and document-level Component indexes and document-level indexes can be combined in search indexes can be combined in search operations (and special operators permit operations (and special operators permit selection of document or components as the selection of document or components as the resultresult

Page 37: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Component DefinitionsComponent DefinitionsComponent DefinitionsComponent Definitions<COMPONENTS><COMPONENTDEF><COMPONENTNAME> TESTDATA/COMPONENT_DB1 </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG> <TAGSPEC> <FTAG>mainenty </FTAG> <FTAG>titles </FTAG> </TAGSPEC></COMPSTARTTAG><COMPENDTAG> <TAGSPEC><FTAG>Fld300 </FTAG></TAGSPEC></COMPENDTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> TESTDATA/comp1index1.author…</INDEXDEF></COMPONENTDEF></COMPONENTS>

Page 38: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Result Formatting (Display)Result Formatting (Display)Result Formatting (Display)Result Formatting (Display)<DISPOPTIONS>KEEP_ENTITIES</DISPOPTIONS>

<DISPLAY> <FORMAT NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="TAGSET-G"> <clusmap> <from> <tagspec> <ftag>DOCNO</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> </clusmap> </convert></FORMAT></DISPLAY>

Page 39: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ********************* Config for INEX evaluation ****************** --><!-- ******************************************************************* --><!-- This is the config file for the Cheshire II TREC interactive Database --><!-- new version uses proximity indexes... -->

<DBCONFIG><DBENV>/projects/metadata/cheshire/TREC/cheshire_index </DBENV>

<!-- --><!-- INEX TEST DATABASE FILEDEF --><!-- -->

<FILEDEF TYPE=XML><DEFAULTPATH> /projects/metadata/cheshire/INEX </DEFAULTPATH><!-- filetag is the "shorthand" name of the file --><FILETAG> INEX </FILETAG>

<!-- filename is the full path name of the main data directory --><FILENAME> inex-1.3/xml </FILENAME>

<CONTINCLUDE> inex-1.3/xml_main.cont </CONTINCLUDE>

<!-- fileDTD is the full path name of the file's DTD --><FILEDTD> inex-1.3/dtd/wrapper.dtd </FILEDTD><SGMLCAT> inex-1.3/dtd/catalog </SGMLCAT>

<!-- assocfil is the full path name of the file's Associator --><ASSOCFIL> inex-1.3/xml_main.assoc </ASSOCFIL>

<!-- history is the full path name of the file's history file --><HISTORY> inex.history </HISTORY>

Page 40: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- The following are the index definitions for the file --><INDEXES>

<!-- ******************************************************************* --><!-- ************************* DOC NO. ********************************* --><!-- ******************************************************************* --><!-- The following provides document number access. --><INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=DO_NOT_NORMALIZE PRIMARYKEY=IGNORE><INDXNAME> indexes/docno.index </INDXNAME><INDXTAG> docno </INDXTAG>

<INDXMAP><USE> 12 </USE><struct> 1 </struct> </INDXMAP>

<INDXMAP><USE> 12 </USE><struct> 2 </struct> </INDXMAP>

<INDXMAP><USE> 12 </USE><struct> 6 </struct> </INDXMAP>

<INDXKEY><TAGSPEC><FTAG> doi </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 41: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ********************** PERSONAL AUTHOR/BYLINE ********************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/pauthor.index</INDXNAME><INDXTAG> pauthor </INDXTAG>

<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 1 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXMAP> <USE> 1004 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<!-- The stoplist for this file --><STOPLIST> indexes/authorstoplist </STOPLIST>

<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>au</S><S>snm</S><FTAG>fm</FTAG><S>au</S><S>fnm</S></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 42: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* TITLE/HEADLINE ************************** --><!-- ******************************************************************* --><!-- The following provides keyword title access --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM><INDXNAME> indexes/title.index </INDXNAME><INDXTAG> title </INDXTAG>

<INDXMAP><USE> 4 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<INDXMAP> <USE> 5 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<INDXMAP> <USE> 6 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/titlestoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>tig</S><S>atl</S></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 43: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* TOPIC *********************************** --><!-- ******************************************************************* --><!-- The following is the primary index for probabilistic searches --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM><INDXNAME> indexes/topic.index </INDXNAME><INDXTAG> topic </INDXTAG>

<INDXMAP><USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>…<INDXMAP><USE> 1017 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>tig</S><S>atl</S><FTAG>abs</FTAG><FTAG>bdy</FTAG><FTAG>bibl</FTAG><S>bb</S><S>atl</S><FTAG>app</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 44: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************** DATE *********************************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR><INDXNAME> indexes/date.index</INDXNAME><INDXTAG> date</INDXTAG>

<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP><USE> 30 </USE><POSIT> 3 </posit> <struct> 6 </struct></INDXMAP><INDXMAP><USE> 30 </USE><POSIT> 3 </posit> <struct> 5 </struct></INDXMAP>

<INDXKEY><TAGSPEC>

<FTAG>hdr2</FTAG><s>yr</s></TAGSPEC></INDXKEY></INDEXDEF>

Page 45: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************** JOURNAL ******************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/journal.index</INDXNAME><INDXTAG> journal</INDXTAG>

<INDXMAP><USE> 1022 </USE><POSIT> 3 </posit> <struct> 6 </struct></INDXMAP><INDXMAP><USE> 1022 </USE><POSIT> 3 </posit> <struct> 5 </struct></INDXMAP>

<INDXKEY><TAGSPEC>

<FTAG>hdr1</FTAG><s>ti</s></TAGSPEC></INDXKEY></INDEXDEF>

Page 46: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* KEYWORDS********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/keywords.index </INDXNAME><INDXTAG> kwd </INDXTAG>

<INDXMAP><USE> 3121 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>kwd</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 47: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* ABSTRACT********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/abstract.index </INDXNAME><INDXTAG> abstract </INDXTAG>

<INDXMAP><USE> 62 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>abs</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 48: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- The following index has contents of the SEQUENCE attribute of the --><!-- au (author) tag: either "first" or "additional" --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/author_seq.index</INDXNAME><INDXTAG> author_seq </INDXTAG><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>au</S><ATTR>sequence</ATTR></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 49: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* Bib author Forename ******************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/bib_author_fnm.index</INDXNAME><INDXTAG> bib_author_fnm </INDXTAG>

<INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<INDXKEY><TAGSPEC><FTAG>bb</FTAG><s>au</s><s>fnm</s> </TAGSPEC> </INDXKEY> </INDEXDEF>

Page 50: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* Bib author surname ******************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/bib_author_snm.index</INDXNAME><INDXTAG> bib_author_snm </INDXTAG>

<INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<INDXKEY><TAGSPEC><FTAG>bb</FTAG><s>au</s><s>snm</s> </TAGSPEC> </INDXKEY> </INDEXDEF>

Page 51: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* FIGURES ********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM><INDXNAME> indexes/fig.index </INDXNAME><INDXTAG> fig </INDXTAG>

<INDXMAP><USE> 3150 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fig</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 52: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* acknowledgements ************************ --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM><INDXNAME> indexes/ack.index </INDXNAME><INDXTAG> ack </INDXTAG>

<INDXMAP><USE> 3188 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>ack</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 53: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* alltitles ******************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/alltitles.index </INDXNAME><INDXTAG> alltitles </INDXTAG>

<INDXMAP><USE> 3188 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/titlestoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>atl</FTAG><FTAG>st</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 54: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* Affiliation ***************************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/affil.index </INDXNAME><INDXTAG> affil </INDXTAG>

<INDXMAP><USE> 3189 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/titlestoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fm</FTAG><s>aff</s></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 55: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* FNO ********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=none><INDXNAME> indexes/fno.index </INDXNAME><INDXTAG> fno </INDXTAG>

<INDXMAP><USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>fno</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 56: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* FIGNO ********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=INTEGER NORMAL=NONE><INDXNAME> indexes/figno.index </INDXNAME><INDXTAG> figno </INDXTAG>

<INDXMAP><USE> 3193 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>fig</FTAG><s>no</s></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 57: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* topicshort ******************************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/topicshort.index </INDXNAME><INDXTAG> topicshort </INDXTAG>

<INDXMAP><USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>tig</S><S>atl</S><FTAG>abs</FTAG><FTAG>kwd</FTAG><FTAG>st</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

</INDEXES>

Page 58: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTS><COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_SECTION </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>sec</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def -->

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE><INDXNAME> indexes/sec_title2.index</INDXNAME><INDXTAG> sec_title </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/titlestoplist </STOPLIST>

<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>sec</FTAG><s>st</s></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 59: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/sec_words.index</INDXNAME><INDXTAG> sec_words </INDXTAG>

<!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST>

<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>sec</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

</COMPONENTINDEXES></COMPONENTDEF>

Page 60: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_BIB </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>bm</FTAG><S>bib</S><s>bibl</s><s>bb</s></TAGSPEC></COMPSTARTTAG><!-- /* no end tag */ --><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/bib_author.index</INDXNAME><INDXTAG> bib_author </INDXTAG><!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>au</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF>

Page 61: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE><INDXNAME> indexes/bib_title.index</INDXNAME><INDXTAG> bib_title </INDXTAG>

<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 33 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>atl</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF>

Page 62: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR><INDXNAME> indexes/bib_date.index</INDXNAME><INDXTAG> bib_date </INDXTAG>

<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 31 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>pdt</FTAG><s>yr</s> </TAGSPEC> </INDXKEY> </INDEXDEF>

</COMPONENTINDEXES></COMPONENTDEF>

Page 63: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST><!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

</COMPONENTINDEXES></COMPONENTDEF>

Page 64: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_FIG </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>fig</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/fig_caption.index</INDXNAME><INDXTAG> fig_caption </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<!-- The stoplist for this file --><STOPLIST> indexes/titlestoplist </STOPLIST>

<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>fgc</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF></COMPONENTINDEXES></COMPONENTDEF>

Page 65: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_VITAE </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>vt</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE><INDXNAME> indexes/vitae_words.index</INDXNAME><INDXTAG> vt_vitae </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/titlestoplist </STOPLIST>

<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>vt</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF></COMPONENTINDEXES></COMPONENTDEF></COMPONENTS>

Page 66: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<DISPOPTIONS>KEEP_ENTITIES</DISPOPTIONS>

<DISPLAY> <DISPLAYDEF NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="MIXED"> <clusmap> <from> <tagspec> <ftag>doi</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> <from> <tagspec> <ftag>#DBNAME#</ftag> </tagspec></from>…

Page 67: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<DISPLAYDEF name="XML_ELEMENT_" OID="1.2.840.10003.5.109.10"><convert function="XML_ELEMENT"> <clusmap> <from> <tagspec> <ftag>#FILENAME#</ftag> </tagspec></from> <to> <tagspec> <ftag>FILENAME</ftag> </tagspec></to> <from> <tagspec> <ftag>#RANK#</ftag> </tagspec></from> <to> <tagspec> <ftag>RANK </ftag> </tagspec></to> …

…<from> <tagspec> <ftag>#RAWSCORE#</ftag> </tagspec></from> <to> <tagspec> <ftag>RAWSCORE </ftag> </tagspec></to> <from> <tagspec> <ftag> SUBST_ELEMENT </ftag> </tagspec></from> <to> <tagspec> <ftag> SUBST_ELEMENT </ftag> </tagspec> </to> </clusmap></convert></DISPLAYDEF>

</DISPLAY></FILEDEF>

</DBCONFIG>

Page 68: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST><!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

</COMPONENTINDEXES></COMPONENTDEF>

Page 69: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST><!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

</COMPONENTINDEXES></COMPONENTDEF>

Page 70: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST><!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

</COMPONENTINDEXES></COMPONENTDEF>

Page 71: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

XML Schemas and Element XML Schemas and Element RetrievalRetrieval

Page 72: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

XML Schema SupportXML Schema SupportXML Schema SupportXML Schema Support

• XML Schemas or DTD’s can be used to XML Schemas or DTD’s can be used to define the data contentsdefine the data contents

• Tested with a wide variety of schemas Tested with a wide variety of schemas including METS (with various supporting including METS (with various supporting schemas)schemas)

Page 73: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

XML Element ExtractionXML Element ExtractionXML Element ExtractionXML Element Extraction

• A new search “ElementSetName” is A new search “ElementSetName” is XML_ELEMENT_XML_ELEMENT_

• Any Xpath, element name, or regular Any Xpath, element name, or regular expression can be included following the expression can be included following the final underscore when submitting a present final underscore when submitting a present request (Note only a subset of full Xpath is request (Note only a subset of full Xpath is available)available)

• The matching elements are extracted from The matching elements are extracted from the records matching the search and the records matching the search and delivered in a simple format..delivered in a simple format..

Page 74: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

XML ExtractionXML ExtractionXML ExtractionXML Extraction

% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…

Page 75: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Database Storage Database Storage Database Storage Database Storage

• All data stored as SGML/XML flat text files plus All data stored as SGML/XML flat text files plus optional linked full-text (non-XML) files optional linked full-text (non-XML) files

• File format is defined though SGML/XML DTD File format is defined though SGML/XML DTD (also flat text file) or Schema(also flat text file) or Schema

• ““Associator” files provide indexed direct access to Associator” files provide indexed direct access to each record in SGML/XML files.each record in SGML/XML files.– Contain offset and record length for each “record”Contain offset and record length for each “record”

– Associators can be built to index any conformant Associators can be built to index any conformant document in a directory sub-treedocument in a directory sub-tree

Page 76: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX CO RunsINEX CO RunsINEX CO RunsINEX CO Runs

• Three official, one later run - all Title-onlyThree official, one later run - all Title-only– Fusion - Combines Okapi and LR using the Fusion - Combines Okapi and LR using the

MERGE_CMBZ operatorMERGE_CMBZ operator– NewParms (LR)- Using only LR with the new NewParms (LR)- Using only LR with the new

parametersparameters– Feedback - An attempt at blind relevance Feedback - An attempt at blind relevance

feedbackfeedback

– PostFusion - Fusion of the new LR coefficients PostFusion - Fusion of the new LR coefficients and Okapiand Okapi

Page 77: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Query Generation - COQuery Generation - COQuery Generation - COQuery Generation - CO

• # 162 TITLE = Text and Index Compression # 162 TITLE = Text and Index Compression Algorithms Algorithms

• QUERY: QUERY: topicshort @+ {Text and Index topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !{Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms})Compression Algorithms})

• @+ is Okapi, @ is LR@+ is Okapi, @ is LR• !MERGE_CMBZ is a normalized score summation !MERGE_CMBZ is a normalized score summation

and enhancementand enhancement

Page 78: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX CO Runs INEX CO Runs INEX CO Runs INEX CO Runs

Generalized Strict

Avg PrecFUSION = 0.0642NEWPARMS = 0.0582FDBK = 0.0415POSTFUS = 0.0690

Avg PrecFUSION = 0.0923NEWPARMS = 0.0853FDBK = 0.0390POSTFUS = 0.0952

Page 79: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

INEX VCAS RunsINEX VCAS RunsINEX VCAS RunsINEX VCAS Runs

• Two official runsTwo official runs– FUSVCAS - Element fusion using LR and FUSVCAS - Element fusion using LR and

various operators for path restrictionvarious operators for path restriction– NEWVCAS - Using the new LR coefficients NEWVCAS - Using the new LR coefficients

for each appropriate index and various for each appropriate index and various operators for path restrictionoperators for path restriction

Page 80: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Query Generation - VCASQuery Generation - VCASQuery Generation - VCASQuery Generation - VCAS

• #66 TITLE = //article[about(., intelligent #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board transport systems)]//sec[about(., on-board route planning navigation system for route planning navigation system for automobiles)]automobiles)]

• Submitted query = Submitted query = ((topic @ {intelligent ((topic @ {intelligent transport systems})) !RESTRICT_FROM transport systems})) !RESTRICT_FROM ((sec_words @ {on-board route planning ((sec_words @ {on-board route planning navigation system for automobiles}))navigation system for automobiles}))

• Target elements: sec|ss1|ss2|ss3Target elements: sec|ss1|ss2|ss3

Page 81: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

VCAS ResultsVCAS ResultsVCAS ResultsVCAS Results

Generalized Strict

Avg PrecFUSVCAS = 0.0321NEWVCAS = 0.0270

Avg PrecFUSVCAS = 0.0601NEWVCAS = 0.0569

Page 82: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Heterogeneous TrackHeterogeneous TrackHeterogeneous TrackHeterogeneous Track

• Approach using the Cheshire’s Virtual Approach using the Cheshire’s Virtual Database optionsDatabase options– Primarily a version of distributed IRPrimarily a version of distributed IR– Each collection indexed separatelyEach collection indexed separately– Search via Z39.50 distributed queriesSearch via Z39.50 distributed queries– Z39.50 Attribute mapping used to map query Z39.50 Attribute mapping used to map query

indexes to appropriate elements in a given indexes to appropriate elements in a given collectioncollection

– Only LR used and collection results merged using Only LR used and collection results merged using probability of relevance for each collection resultprobability of relevance for each collection result

Page 83: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Heterogeneous Track IssuesHeterogeneous Track IssuesHeterogeneous Track IssuesHeterogeneous Track Issues

• Very large “Documents” Very large “Documents” – Our approach was to segmentOur approach was to segment

• Reporting Xpath after segmenting large Reporting Xpath after segmenting large documentsdocuments

Page 84: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Database StorageDatabase StorageDatabase StorageDatabase Storage

AssociatorFile

Page DataFile

SGML/XMLFile

HistoryFile

DTDFileCluster

File

PostingsFile

IndexFile

IndexFile

RemoteRDBMS

ConfigFile

IndexFile

AssociatorFile

Prox data File

Page 85: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Client/Server ArchitectureClient/Server ArchitectureClient/Server ArchitectureClient/Server Architecture

• Server Supports:Server Supports:– Database storageDatabase storage– Indexing Indexing – Z39.50 access to local dataZ39.50 access to local data– Boolean and Probabilistic SearchingBoolean and Probabilistic Searching– Relevance FeedbackRelevance Feedback– External SQL database supportExternal SQL database support

• Client Supports:Client Supports:– Programmable (Tcl/Tk – Python soon) Graphical User InterfaceProgrammable (Tcl/Tk – Python soon) Graphical User Interface– Z39.50 access to remote serversZ39.50 access to remote servers– SGML & MARC formattingSGML & MARC formatting

• Combined Client/Server CGI scripting via WebCheshireCombined Client/Server CGI scripting via WebCheshire

Page 86: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Z39.50 OverviewZ39.50 OverviewZ39.50 OverviewZ39.50 Overview

UI

UI

MapQuery

Internet

MapResults

MapQuery

MapResults

MapQuery

MapResults

SearchEngine

Page 87: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Two Protocols: HTTP & Z39.50Two Protocols: HTTP & Z39.50

SYSTEM BEHAVIOR HTTP Z39.50State maintenance client serverSessions no yesPolicies adaptable to link speed no yesSynch/asynch synch bothFixed/negotiated protocol fixed negFixed/negotiated doc formats none NegStandardized Metadata no yes

Page 88: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Server Z39.50 SupportServer Z39.50 SupportServer Z39.50 SupportServer Z39.50 Support

• Locally developed Z39.50 LibraryLocally developed Z39.50 Library

• Extended version 3 supportExtended version 3 support– support version 3 attributes in BIB-1 including support version 3 attributes in BIB-1 including

“stem”, “relevance”, etc. Also adding support “stem”, “relevance”, etc. Also adding support for “type 102” ranked queries (version 4)for “type 102” ranked queries (version 4)

• Can provide both MARC, SUTRS and Can provide both MARC, SUTRS and SGML records, support for Explain and SGML records, support for Explain and GRS-1 conversion of any SGML recordsGRS-1 conversion of any SGML records

Page 89: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Distributed SearchDistributed SearchDistributed SearchDistributed Search

Page 90: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

The ProblemThe ProblemThe ProblemThe Problem• The Digital Library vision -- Access to everyone The Digital Library vision -- Access to everyone

for “all human knowledge”for “all human knowledge”• Lyman and Varian’s estimates of the “Dark Web”Lyman and Varian’s estimates of the “Dark Web”• Hundreds or Thousands of servers with databases Hundreds or Thousands of servers with databases

ranging widely in content, topic, formatranging widely in content, topic, format– Broadcast search is expensive in terms of bandwidth Broadcast search is expensive in terms of bandwidth

and in processing too many irrelevant resultsand in processing too many irrelevant results– How to select the “best” ones to search?How to select the “best” ones to search?

• Which resource to search first?Which resource to search first?• Which to search next if more is wanted?Which to search next if more is wanted?

– Topical /domain constraints on the search selectionsTopical /domain constraints on the search selections– Variable contents of database (metadata only, full text, Variable contents of database (metadata only, full text,

multimedia…)multimedia…)

Page 91: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Distributed Search TasksDistributed Search TasksDistributed Search TasksDistributed Search Tasks• Resource DescriptionResource Description

– How to collect metadata about digital libraries and their How to collect metadata about digital libraries and their collections or databasescollections or databases

• Resource SelectionResource Selection– How to select relevant digital library collections or databases How to select relevant digital library collections or databases

from a large number of databasesfrom a large number of databases

• Distributed SearchDistributed Search– How to perform parallel or sequential searching over the How to perform parallel or sequential searching over the

selected digital library databasesselected digital library databases

• Data FusionData Fusion– How to merge query results from different digital libraries with How to merge query results from different digital libraries with

their different search engines, differing record structures, etc.their different search engines, differing record structures, etc.

Page 92: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery

An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery

• Distributed resource representation and discoveryDistributed resource representation and discovery– New approach to building resource descriptions based on New approach to building resource descriptions based on

Z39.50Z39.50– Instead of using Instead of using broadcastbroadcast search across resources we are using search across resources we are using

two Z39.50 Servicestwo Z39.50 Services• Identification of database metadata using Z39.50 Identification of database metadata using Z39.50 ExplainExplain• Extraction of distributed indexes using Z39.50 Extraction of distributed indexes using Z39.50 SCANSCAN

• Evaluation Evaluation – How efficiently can we build distributed indexes? How efficiently can we build distributed indexes? – How effectively can we choose databases using the index?How effectively can we choose databases using the index?– How effective is merging search results from multiple sources?How effective is merging search results from multiple sources?– Can we build hierarchies of servers Can we build hierarchies of servers

(general/meta-topical/individual)?(general/meta-topical/individual)?

Page 93: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Z39.50 ExplainZ39.50 ExplainZ39.50 ExplainZ39.50 Explain

• Explain supports searches for Explain supports searches for – Server-Level metadata Server-Level metadata

• Server NameServer Name

• IP AddressesIP Addresses

• Ports Ports

– Database-Level metadataDatabase-Level metadata• Database nameDatabase name

• Search attributes (indexes and combinations) Search attributes (indexes and combinations)

– Support metadata (record syntaxes, etc)Support metadata (record syntaxes, etc)

Page 94: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Z39.50 SCANZ39.50 SCANZ39.50 SCANZ39.50 SCAN

• Originally intended to support Browsing Originally intended to support Browsing • Query for Query for

– DatabaseDatabase– Attributes plus Term (i.e., index and start point)Attributes plus Term (i.e., index and start point)– Step SizeStep Size– Number of terms to retrieveNumber of terms to retrieve– Position in Response setPosition in Response set

• Results Results – Number of terms returnedNumber of terms returned– List of Terms and their frequency in the database (for List of Terms and their frequency in the database (for

the given attribute combination)the given attribute combination)

Page 95: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Z39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN Results% zscan title cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 27}{cat-fight 1}{catalan 19}{catalogu 37}{catalonia 8}{catalyt 2}{catania 1}{cataract 1}{catch 173}{catch-all 3}{catch-up 2} …

zscan topic cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 706}{cat-and-mouse 19}{cat-burglar 1}{cat-carrying 1}{cat-egory 1}{cat-fight 1}{cat-gut 1}{cat-litter 1}{cat-lovers 2}{cat-pee 1}{cat-run 1}{cat-scanners 1} …

Syntax: zscan indexname1 term stepsize number_of_terms pref_pos

Page 96: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Resource Index CreationResource Index CreationResource Index CreationResource Index Creation• For all servers, or a topical subset…For all servers, or a topical subset…

– Get Explain information Get Explain information – For each indexFor each index

• Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency• Add term + freq + source index + database metadata Add term + freq + source index + database metadata

to the XML “Collection Document” for the resourceto the XML “Collection Document” for the resource– Planned extensions:Planned extensions:

• Post-Process indexes (especially Geo Names, etc) Post-Process indexes (especially Geo Names, etc) for special types of data for special types of data

– e.g. create “geographical coverage” indexese.g. create “geographical coverage” indexes

Page 97: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

MetaSearch ApproachMetaSearch ApproachMetaSearch ApproachMetaSearch Approach

MetaSearchServer

Map ExplainAnd ScanQueries

Internet

MapResults

MapQuery

MapResults

SearchEngine

DB2DB 1

MapQuery

MapResults

SearchEngine

DB 4DB 3

DistributedIndex

SearchEngine

Db 6Db 5

Page 98: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Known Issues and ProblemsKnown Issues and ProblemsKnown Issues and ProblemsKnown Issues and Problems

• Not all Z39.50 Servers support SCAN or ExplainNot all Z39.50 Servers support SCAN or Explain• Solutions that appear to work well:Solutions that appear to work well:

– Probing for attributes instead of explain (e.g. DC Probing for attributes instead of explain (e.g. DC attributes or analogs)attributes or analogs)

– We also support OAI and can extract OAI metadata for We also support OAI and can extract OAI metadata for servers that support OAIservers that support OAI

– Query-based sampling (Callan)Query-based sampling (Callan)

• Collection Documents are static and need to be Collection Documents are static and need to be replaced when the associated collection changesreplaced when the associated collection changes

Page 99: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Evaluation Evaluation Evaluation Evaluation

• Test EnvironmentTest Environment– TREC Tipster data (approx. 3 GB)TREC Tipster data (approx. 3 GB)

– Partitioned into 236 smaller collections based on source Partitioned into 236 smaller collections based on source and date by month (no DOE)and date by month (no DOE)

• High size variability (from 1 to thousands of records)High size variability (from 1 to thousands of records)

• Same database as used in other distributed search studies by J. Same database as used in other distributed search studies by J. French and J. Callan among othersFrench and J. Callan among others

– Used TREC topics 51-150 for evaluation (these are the Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 only topics with relevance judgements for all 3 TIPSTER disksTIPSTER disks

Page 100: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Harvesting EfficiencyHarvesting EfficiencyHarvesting EfficiencyHarvesting Efficiency

• Tested using the databases on the previous slide + Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb)the full FT database (210,158 records ~ 600 Mb)

• Average of 23.07 seconds per database to SCAN Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create each database (3.4 indexes on average) and create a collection representative, over the networka collection representative, over the network

• Average of 14.07 secondsAverage of 14.07 seconds• Also tested larger databases (E.g. TREC FT Also tested larger databases (E.g. TREC FT

database ~600 Mb with 7 indexes was harvested in database ~600 Mb with 7 indexes was harvested in 131 seconds. 131 seconds.

Page 101: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Our Collection Ranking Our Collection Ranking ApproachApproach

Our Collection Ranking Our Collection Ranking ApproachApproach

• We attempt to estimate the probability of We attempt to estimate the probability of relevance for a given collection with respect to relevance for a given collection with respect to a query using the Logistic Regression method a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for Dabney, A. Chen) with new algorithm for weight calculation at retrieval timeweight calculation at retrieval time

• Estimates from multiple extracted indexes are Estimates from multiple extracted indexes are combined to provide an overall ranking score combined to provide an overall ranking score for a given resource (I.e., fusion of multiple for a given resource (I.e., fusion of multiple query results)query results)

Page 102: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

6

10),|(

iii XccCQRP

Probability of relevance for a given index is based on logistic regression from a sample set documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:

Page 103: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

MX

n

nNICF

ICFM

X

CLX

CAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

10

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Collection Frequency

Collection size estimate

Average Inverse Collection Frequency

Inverse Document Frequency (N = Number of collections

M = Number of Terms in common between query and document

Page 104: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

EvaluationEvaluationEvaluationEvaluation• Effectiveness Effectiveness

– Tested using the collection representatives described Tested using the collection representatives described above (as harvested from over the network) and the above (as harvested from over the network) and the TIPSTER relevance judgements TIPSTER relevance judgements

– Testing by comparing our approach to known Testing by comparing our approach to known algorithms for ranking collectionsalgorithms for ranking collections

– Results were measured against reported results for the Results were measured against reported results for the Ideal and CORI algorithms and against the optimal Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX)“Relevance Based Ranking” (MAX)

– Recall analog (How many of the Rel docs occurred in Recall analog (How many of the Rel docs occurred in the top n databases – averaged)the top n databases – averaged)

Page 105: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Titles only (short query)Titles only (short query)Titles only (short query)Titles only (short query)

Page 106: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

FutureFutureFutureFuture

• Logically Clustering servers by topicLogically Clustering servers by topic

• Meta-Meta Servers (treating the Meta-Meta Servers (treating the MetaSearch database as just another MetaSearch database as just another database)database)

Page 107: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers

Replicatedservers

Meta-TopicalServers

General ServersDatabaseServers

Page 108: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Geographic Operators and Search Geographic Operators and Search RankingRanking

Geographic Operators and Search Geographic Operators and Search RankingRanking

Page 109: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations

• Operators established for the GEO Z39.50 profileOperators established for the GEO Z39.50 profile• Implemented using special operations on indexesImplemented using special operations on indexes• Indexing allows extraction of geographic Indexing allows extraction of geographic

coordinates and dates from SGML/XML data in a coordinates and dates from SGML/XML data in a variety of formatsvariety of formats

• Normalized internal representation in indexesNormalized internal representation in indexes• Search using geographic and time elements as Search using geographic and time elements as

primary or limiting search elementsprimary or limiting search elements

Page 110: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations

• X-based interfaces permit (simple) map X-based interfaces permit (simple) map drawing and searchdrawing and search

• Interface to MapServer for web-based map Interface to MapServer for web-based map searchingsearching

Page 111: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

GEO Geographic operatorsGEO Geographic operatorsGEO Geographic operatorsGEO Geographic operators>=< >=< OverlapOverlap Search region and data OverlapSearch region and data Overlap

>#< >#< Fully EnclosedFully Enclosed Data fully enclosed in search reg.Data fully enclosed in search reg.

<#><#> EnclosesEncloses Data fully encloses search regionData fully encloses search region

<>#<># Fully Outside Fully Outside Data outside of search regionData outside of search region

++++ NearNear Data is near search regionData is near search region

:<::<: BeforeBefore Data date is before search dateData date is before search date

:<=::<=: Before or Before or DuringDuring

Data date is before or during Data date is before or during search datesearch date

:>=::>=: During or During or AfterAfter

Data date is during or after search Data date is during or after search datedate

:>::>: AfterAfter Data date is after search dateData date is after search date

Page 112: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Overlaps searchOverlaps searchOverlaps searchOverlaps search

Page 113: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Fully Enclosed SearchFully Enclosed SearchFully Enclosed SearchFully Enclosed Search

Page 114: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Map-Based SearchMap-Based SearchMap-Based SearchMap-Based Search

Page 115: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

GeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web Interface

Page 116: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

MySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQL

Page 117: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

RDBMS SupportRDBMS SupportRDBMS SupportRDBMS Support

• There are two reasons for RDBMS supportThere are two reasons for RDBMS support– IR systems are not meant for LOTS of update IR systems are not meant for LOTS of update

transactionstransactions

– Some application need to have access to both relational Some application need to have access to both relational data and text data via Z39.50data and text data via Z39.50

• Both MySQL and PostgreSQL are popular open Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used source RDBMS and now either can now be used via Cheshirevia Cheshire– Z39.50 mappings to RDBMS columnsZ39.50 mappings to RDBMS columns

– ““ZQL” submission of SQL as Z39.50 Type 0 queryZQL” submission of SQL as Z39.50 Type 0 query

Page 118: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Protocol SupportProtocol SupportProtocol SupportProtocol Support

Page 119: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

ProtocolsProtocolsProtocolsProtocols

• In Cheshire II most protocols (except In Cheshire II most protocols (except Z39.50) are implemented using scriptingZ39.50) are implemented using scripting

• Example scripts to support the following Example scripts to support the following are included in the distribution are included in the distribution – OAIOAI– SRW (Python version)SRW (Python version)– SOAPSOAP– SDLIPSDLIP

Page 120: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Cheshire III Design and Cheshire III Design and DevelopmentDevelopment

Cheshire III Design and Cheshire III Design and DevelopmentDevelopment

Page 121: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Cheshire III GoalsCheshire III GoalsCheshire III GoalsCheshire III Goals• Retain or reproduce (and refine) all Cheshire II Retain or reproduce (and refine) all Cheshire II

featuresfeatures– ““Spring cleaning” of code baseSpring cleaning” of code base– Add Full Unicode Support Add Full Unicode Support – Store most system and content data in the databaseStore most system and content data in the database

• Permit easy and efficient integration in Web Permit easy and efficient integration in Web ServicesServices

• Use threaded server for economy of resource usageUse threaded server for economy of resource usage• Enhanced Multiprotocol support Enhanced Multiprotocol support • Support for distributed processing (I.e. GRID Support for distributed processing (I.e. GRID

clusters)clusters)• Enhance expandability and “drop in’ functionalityEnhance expandability and “drop in’ functionality• Interfaces and/or APIs for Java, Python, C/C++Interfaces and/or APIs for Java, Python, C/C++

Page 122: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Cheshire II Design OverviewCheshire II Design OverviewCheshire II Design OverviewCheshire II Design Overview

XML DOCS

XMLDIRECTORY

INDEXCLUSTER

INDEXCHESHIRE

CONT

BUILD ASSOC

ZSERVER

CONFIG

COMPONENTDEFINITION

INDEX(S)

ASSOC

CLUSTEREXTENSION

Page 123: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Cheshire III Server OverviewCheshire III Server OverviewCheshire III Server OverviewCheshire III Server Overview

API

INDEXING

T R RX E AS C NL O ST R F D O R M S

SEARCH

P HR AO NT DO LC EO RL

DB API

REMOTESYSTEMS

(any protocol)

XMLCONFIG

& MetadataINFO

INDEXES

LOCAL DB

STAFF UI

CONFIG

NETWORK

RESULTSETS

SCAN

USERINFOC

ONFIG&CONTROL

ACCESSINFO

AUTHENTICATION

CLUSTERING

Native calls

Z39.50SOAPOAI

JDBC

Fetch IDPut ID

OpenURL

APACHE

INTERFACE

SERVERCONTROL

UDDIWSRP

SRW

Normalization

ClientUser/

Clients

OGIS

Cheshire III SERVER

Page 124: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

API

INDEXING

T R RX E AS C NL O ST R F D O R M S

SEARCH

P HR AO NT DO LC EO RL

DB API

REMOTESYSTEMS

(any protocol)

XMLCONFIG

& MetadataINFO

INDEXES

LOCAL DB

STAFF UI

CONFIG

NETWORK

RESULTSETS

SCAN

USERINFO

CONFIG&CONTROL

ACCESSINFO

AUTHENTICATION

CLUSTERING

Native calls

Z39.50

SOAP

OAI

JDBC

Fetch ID

Put ID

OpenURL

APACHE

INTERFACE

SERVERCONTROL

UDDI

WSRP

SRW

Normalization

ClientUser/

Clients

OGIS

Cheshire III SERVER

Page 125: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Retain FeaturesRetain FeaturesRetain FeaturesRetain Features

• The intent is to permit all of the types of in The intent is to permit all of the types of in indexing, searching and record formatting indexing, searching and record formatting available now, while making it easier to add available now, while making it easier to add new capabilitiesnew capabilities

• The new system will also support full The new system will also support full UNICODE for content and for metadataUNICODE for content and for metadata

• Store metadata and content in the database Store metadata and content in the database (including config information, etc.)(including config information, etc.)

Page 126: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Permit easy integration of Web Permit easy integration of Web ServicesServices

Permit easy integration of Web Permit easy integration of Web ServicesServices

• The assumption is that the web server will The assumption is that the web server will be the central server mechanism in the be the central server mechanism in the future.future.

• The new design relies on the session The new design relies on the session handling, threading and load management handling, threading and load management tools available in Apache (2.0.40+)tools available in Apache (2.0.40+)

• The Cheshire server is dynamically loaded The Cheshire server is dynamically loaded as part of the Web Serveras part of the Web Server

Page 127: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Multiprotocol SupportMultiprotocol SupportMultiprotocol SupportMultiprotocol Support

• The Web server handles the network issues The Web server handles the network issues and passes requests in various protocols and passes requests in various protocols along to the Cheshire Server. along to the Cheshire Server.

• Individual Protocol “plugins” and the Individual Protocol “plugins” and the Protocol Handler convert search, display, Protocol Handler convert search, display, and metadata requests in a particular and metadata requests in a particular protocol to the internal Cheshire III control protocol to the internal Cheshire III control language, and convert outgoing message language, and convert outgoing message and data to the appropriate protocol formand data to the appropriate protocol form

Page 128: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Distributed & GRID ProcessingDistributed & GRID ProcessingDistributed & GRID ProcessingDistributed & GRID Processing• The server will support protocols for interchange The server will support protocols for interchange

of partial results and collection statistics with a of partial results and collection statistics with a single “Master” controlling the actions of a large single “Master” controlling the actions of a large number of “Slave” serversnumber of “Slave” servers

• These will run in parallel in a GRID environmentThese will run in parallel in a GRID environment• This is still “research” but will probably be using This is still “research” but will probably be using

“Storage Grid” technology from SDSC with our “Storage Grid” technology from SDSC with our own applicationsown applications

• Non-Grid use of the same protocols, etc will be Non-Grid use of the same protocols, etc will be possible (but definitely slower)possible (but definitely slower)

Page 129: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Enhanced ExpanabilityEnhanced ExpanabilityEnhanced ExpanabilityEnhanced Expanability

• Clearly defined APIs for interacting with Clearly defined APIs for interacting with the server will permit easy addition of new the server will permit easy addition of new functionality, or to replace or upgrade functionality, or to replace or upgrade existing functionalityexisting functionality

• Interactive user interface for database Interactive user interface for database configuration and setupconfiguration and setup– We want to make it easier for a We want to make it easier for a

user/administrator to create and manage the user/administrator to create and manage the databasedatabase

Page 130: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Multilingual APIsMultilingual APIsMultilingual APIsMultilingual APIs

• The system is being developed in a The system is being developed in a multilingual environment.multilingual environment.

• We will include the ability to interface with We will include the ability to interface with (at a minimum) Java, Python and C/C++ (at a minimum) Java, Python and C/C++ applications.applications.

• APIs for developing new functions will be APIs for developing new functions will be available in these languages as well available in these languages as well

Page 131: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

DevelopmentDevelopmentDevelopmentDevelopment

• Currently work is going on here (RRL) and Currently work is going on here (RRL) and (primarily) in the UK(primarily) in the UK

• We have incomplete (Alpha) versions of the We have incomplete (Alpha) versions of the system, but haven’t been distributing it in system, but haven’t been distributing it in the current form (changing constantly)the current form (changing constantly)

• First release version is expected in mid-’04First release version is expected in mid-’04

Page 132: March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University

March 2, 2004 Ray R. Larson

Further InformationFurther InformationFurther InformationFurther Information

• Full Cheshire II client and server is open source Full Cheshire II client and server is open source and available for academic and government use: and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ftp://cheshire.berkeley.edu/pub/cheshire/– Includes HTML documentationIncludes HTML documentation

• Project Web Site Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/• Archives Hub http://www.archiveshub.ac.uk/Archives Hub http://www.archiveshub.ac.uk/