march 2, 2004 ray r. larson cheshire ii: features and internals and cheshire iii overview ray r....

March 2, 2004 Ray R. Larson

Cheshire II: Features and InternalsCheshire II: Features and Internalsand Cheshire III overviewand Cheshire III overview

Cheshire II: Features and InternalsCheshire II: Features and Internalsand Cheshire III overviewand Cheshire III overview

Ray R. LarsonRay R. LarsonSchool of Information Management and School of Information Management and

Systems Systems

University of California, BerkeleyUniversity of California, Berkeley


OverviewOverviewOverviewOverview

• Cheshire II feature overview Cheshire II feature overview – Logistic Regression Ranking, Okapi BM-25 and Logistic Regression Ranking, Okapi BM-25 and

Boolean OperationsBoolean Operations

– Fusion OperatorsFusion Operators

• Additions from INEX ‘03Additions from INEX ‘03– Element/Index level re-estimation of LR coefficientsElement/Index level re-estimation of LR coefficients

• Adhoc and Heterogeneous Track MethodologyAdhoc and Heterogeneous Track Methodology• Evaluation Results -AdhocEvaluation Results -Adhoc


Overview of Cheshire IIOverview of Cheshire IIOverview of Cheshire IIOverview of Cheshire II• It supports SGML and XML with components and component indexesIt supports SGML and XML with components and component indexes• It is a client/server applicationIt is a client/server application• Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI,

SOAP, SDLIP also implementedSOAP, SDLIP also implemented• Server supports a Relational Database GatewayServer supports a Relational Database Gateway• Supports Boolean searching of all serversSupports Boolean searching of all servers• Supports probabilistic ranked retrieval in the Cheshire search engine as Supports probabilistic ranked retrieval in the Cheshire search engine as

well as Boolean and proximity searchwell as Boolean and proximity search• Search engine supports ``nearest neighbor'' searches and relevance Search engine supports ``nearest neighbor'' searches and relevance

feedbackfeedback• GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT• WWW/CGI forms interface for DL, using combined client/server CGI WWW/CGI forms interface for DL, using combined client/server CGI

scripting via WebCheshirescripting via WebCheshire• Scriptable clients using Tcl and PythonScriptable clients using Tcl and Python• Store SGML/XML as files or “Datastore” databaseStore SGML/XML as files or “Datastore” database


Cheshire II SearchingCheshire II SearchingCheshire II SearchingCheshire II Searching

Z39.50 Internet

ImagesScannedText

Local Remote

Z39.50

Z39.50

Z39.50


INEX OverviewINEX OverviewINEX OverviewINEX Overview

LocalNet

UIOr

Scripts

MapQuery

MapResults

MapQuery

MapResults

INEXSearchEngine


Boolean Search CapabilityBoolean Search CapabilityBoolean Search CapabilityBoolean Search Capability• All Boolean operations are supportedAll Boolean operations are supported

– ““zfind author x and (title y or subject z) not subject zfind author x and (title y or subject z) not subject A”A”

• Named sets are supported and stored on the Named sets are supported and stored on the serverserver

• Boolean operations between stored sets are Boolean operations between stored sets are supportedsupported– ““zfind SET1 and subject widgets or SET2”zfind SET1 and subject widgets or SET2”

• Nested parentheses and truncation are supportedNested parentheses and truncation are supported– ““zfind xtitle Alice#”zfind xtitle Alice#”


Probabilistic RetrievalProbabilistic RetrievalProbabilistic RetrievalProbabilistic Retrieval• Uses Logistic Regression ranking method developed Uses Logistic Regression ranking method developed

at Berkeley (W. Cooper, F. Gey, D. Dabney, A. at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at Chen) with new algorithm for weigh calculation at retrieval timeretrieval time

• Z39.50 “relevance” operator used to indicate Z39.50 “relevance” operator used to indicate probabilistic searchprobabilistic search

• Any index can have Probabilistic searching Any index can have Probabilistic searching performed:performed:– zfind topic @ “cheshire cats, looking glasses, march hares zfind topic @ “cheshire cats, looking glasses, march hares

and other such things”and other such things”– zfind title @ caucus raceszfind title @ caucus races

• Boolean and Probabilistic elements can be Boolean and Probabilistic elements can be combined:combined:– zfind topic @ government documents and title zfind topic @ government documents and title

guidebooksguidebooks


P(R | Q,D) b0 biX i

i1

6

Probability of relevance is based onProbability of relevance is based onLogistic regression from a sample set of documentsLogistic regression from a sample set of documentsto determine values of the coefficients. to determine values of the coefficients. At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:

For the 6 For the 6 XX attribute measures shown on the next slide attribute measures shown on the next slide

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression



Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes


MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query FrequencyAverage Absolute Query Frequency

Query LengthQuery Length

Average Absolute Component FrequencyAverage Absolute Component Frequency

Document LengthDocument Length

Average Inverse Component FrequencyAverage Inverse Component Frequency

Inverse Component FrequencyInverse Component Frequency

Number of Terms in common between Number of Terms in common between query and Component -- logged query and Component -- logged


Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements

Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements

• Two original approaches:Two original approaches:– Boolean ApproachBoolean Approach

– Non-probabilistic “Fusion Search” Set merger approach Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate is a weighted merger of document scores from separate Boolean and Probabilistic queries Boolean and Probabilistic queries

P(R | Q,D) P(R | Qbool ,D)P(R | Qprob ,D)

P(R | Qbool ,D) 1: if Boolean eval successful for D

0 : Otherwise


Okapi BM25Okapi BM25Okapi BM25Okapi BM25

• Where:Where:• QQ is a query containing terms is a query containing terms TT• K K is is kk11((1-((1-bb) + ) + b.dlb.dl//avdlavdl))• kk11, b , b and and kk33 are parameters , usually 1.2, 0.75 and 7-1000are parameters , usually 1.2, 0.75 and 7-1000• tftf is the frequency of the term in a specific document is the frequency of the term in a specific document• qtf qtf is the frequency of the term in a topic from which is the frequency of the term in a topic from which QQ was derived was derived• dl dl and and avdl avdl are the document length and the average document length are the document length and the average document length

measured in some convenient unitmeasured in some convenient unit• ww(1) (1) is the Robertson-Sparck Jones weight.is the Robertson-Sparck Jones weight.

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w


Merging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking Operators• Extends the capabilities of merging to include Extends the capabilities of merging to include

merger operations in queries like Boolean operatorsmerger operations in queries like Boolean operators• Fuzzy Logic Operators (not used for INEX)Fuzzy Logic Operators (not used for INEX)

– !FUZZY_AND!FUZZY_AND– !FUZZY_OR!FUZZY_OR– !FUZZY_NOT!FUZZY_NOT

• Containment operators: Restrict components to or Containment operators: Restrict components to or with a particular parent with a particular parent – !RESTRICT_FROM!RESTRICT_FROM– !RESTRICT_TO!RESTRICT_TO

• Merge OperatorsMerge Operators– !MERGE_SUM!MERGE_SUM– !MERGE_MEAN!MERGE_MEAN– !MERGE_NORM!MERGE_NORM– !MERGE_CMBZ!MERGE_CMBZ


Subquery

INEX ‘04 Fusion SearchINEX ‘04 Fusion SearchINEX ‘04 Fusion SearchINEX ‘04 Fusion Search

• Merge multiple ranked and Boolean index Merge multiple ranked and Boolean index searches within each query and multiple searches within each query and multiple component search resultsetscomponent search resultsets– Major components merged are Articles, Body, Major components merged are Articles, Body,

Sections, subsections, paragraphsSections, subsections, paragraphs

Subquery

Subquery

Subquery

Comp.QueryResultsComp.

QueryResults

Fusion/Merge

FinalRanked

List


New LR CoefficientsNew LR CoefficientsNew LR CoefficientsNew LR CoefficientsIndexIndex b0b0 b1b1 b2b2 b3b3 b4b4 b5b5 b6b6

BaseBase -3.700-3.700 1.2691.269 -0.310-0.310 0.6790.679 -0.021-0.021 0.2230.223 4.0104.010

topictopic -7.758-7.758 5.6705.670 -3.427-3.427 1.7871.787 -0.030-0.030 1.9521.952 5.8805.880

topicshorttopicshort -6.364-6.364 2.7392.739 -1.443-1.443 1.2281.228 -0.020-0.020 1.2801.280 3.8373.837

abstractabstract -5.892-5.892 2.3182.318 -1.364-1.364 0.8600.860 -0.013-0.013 1.0521.052 3.6003.600

alltitlesalltitles -5.243-5.243 2.3192.319 -1.361-1.361 1.4151.415 -0.037-0.037 1.1801.180 3.6963.696

sec wordssec words -6.392-6.392 2.1252.125 -1.648-1.648 1.1061.106 -0.075-0.075 1.1741.174 3.6323.632

para para wordswords

-8.632-8.632 1.2581.258 -1.654-1.654 1.4851.485 -0.084-0.084 1.1431.143 4.0044.004

Estimates using INEX ‘03 relevance assessments forEstimates using INEX ‘03 relevance assessments forb1 = Average Absolute Query Frequencyb1 = Average Absolute Query Frequencyb2 = Query Lengthb2 = Query Lengthb3 = Average Absolute Component Frequencyb3 = Average Absolute Component Frequencyb4 = Document Lengthb4 = Document Lengthb5 = Average Inverse Component Frequencyb5 = Average Inverse Component Frequencyb6 = Number of Terms in common between queryb6 = Number of Terms in common between query and Component and Component


SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• Underlying native format for all data is SGML or Underlying native format for all data is SGML or XMLXML

• The DTD defines the file format for each fileThe DTD defines the file format for each file• Full SGML/XML parsingFull SGML/XML parsing• SGML/XML Format Configuration Files define SGML/XML Format Configuration Files define

the databasethe database• USMARC DTD and MARC to SGML conversion USMARC DTD and MARC to SGML conversion

(and back again)(and back again)• Access to full-text via special SGML/XML tagsAccess to full-text via special SGML/XML tags


IndexingIndexingIndexingIndexing

• Any SGML/XML tagged field or attribute can be Any SGML/XML tagged field or attribute can be indexed:indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)B-Tree and Hash access via Berkeley DB (Sleepycat)

– Stemming, keyword, exact keys and “special keys”Stemming, keyword, exact keys and “special keys”

– Mapping from any Z39.50 Attribute combination to a Mapping from any Z39.50 Attribute combination to a specific indexspecific index

– Underlying postings information includes term Underlying postings information includes term frequency for probabilistic searchingfrequency for probabilistic searching

• Component extraction with separate component Component extraction with separate component indexesindexes


XML Element ExtractionXML Element ExtractionXML Element ExtractionXML Element Extraction

• A new search “ElementSetName” is A new search “ElementSetName” is XML_ELEMENT_XML_ELEMENT_

• Any Xpath, element name, or regular Any Xpath, element name, or regular expression can be included following the expression can be included following the final underscore when submitting a present final underscore when submitting a present requestrequest

• The matching elements are extracted from The matching elements are extracted from the records matching the search and the records matching the search and delivered in a simple format..delivered in a simple format..

XML ExtractionXML ExtractionXML ExtractionXML Extraction

% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…



• Configuration files for the Server are Configuration files for the Server are SGML/XML:SGML/XML:– They include elements describing all of the data They include elements describing all of the data

files and indexes for the database.files and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to

be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support• Example XML record for a DL documentExample XML record for a DL document

<ELIB-BIB><BIB-VERSION>ELIB-v1.0</BIB-VERSION><ID>756</ID><ENTRY>June 12, 1996</ENTRY><DATE>June 1996</DATE><TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE><ORGANIZATION>University of California</ORGANIZATION><TYPE>report</TYPE><AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL><AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL><PROJECT>SNEP</PROJECT><SERIES>Vol 3</SERIES><PAGES>40</PAGES><TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF><PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF></ELIB-BIB>

<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a>theory and practice /<c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a>J. Wiley,<c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a>ill. ;<c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...

SGML SupportSGML SupportSGML SupportSGML Support

• Example SGML/MARC RecordExample SGML/MARC Record

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML SupportTREC document…TREC document…

<DOC><DOCNO>FT931-3566</DOCNO><PROFILE>_AN-DCPCCAA3FT</PROFILE><DATE>930316</DATE><HEADLINE>FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key tounlocking Tangentopoli - They will set the investigation agenda</HEADLINE><BYLINE> By ROBERT GRAHAM</BYLINE><TEXT>OVER the weekend the Italian media felt obliged to comment on a non-event.No new arrests had taken place in any of the country's ever more numerouscorruption scandals which centre on the illicit funding of political parties...</TEXT><XX> …

…Companies:-</XX><CO>Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale.</CO><XX>Countries:-</XX><CN>ITZ Italy, EC.</CN><XX>Industries:-</XX><IN>P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC.</IN><XX>Types:-</XX> …

…

<TP>CMMT Comment & Analysis.

GOVT Legal issues.

</TP>

<PUB>The Financial Times

</PUB>

<PAGE>

London Page 4

</PAGE>

</DOC>

• INEX DocumentINEX Document<article><fno>C1050</fno><doi>10.1041/C1050s-2000</doi><fm><hdr><hdr1><ti>COMPUTING IN SCIENCE & ENGINEERING</ti><crt><issn>1521-9615</issn>/00/$10.00 <cci><onm>© 2000 IEEE</onm></cci></crt></hdr1><hdr2><obi><volno>Vol. 2</volno><issno>No. 1</issno></obi><pdt><mo>JANUARY/FEBRUARY</mo><yr>2000</yr></pdt><pp>pp. 50-59</pp></hdr2></hdr><tig><atl>The Decompositional Approach to Matrix Computation</atl><pn>pp. 50-59</pn></tig><au sequence="first"><fnm>G.W.</fnm><snm>Stewart</snm><aff><onm>University of Maryland</onm></aff></au><fig><art file="c1050x1.gif" w="425" h="321" tw="150" th="113"/></fig><abs>The introduction of matrix decomposition into numerical linear algebra revolutionized matrix computations. This article outlines the decompositional approach, comments on its history, and surveys the six most widely used decompositions.</abs></fm><bdy><sec><st></st><ip1>In 1951, Paul S. Dwyer published <it>Linear Computations</it>, perhaps the first book devoted entirely to numerical linear algebra.<ref rid="bibc10501" type="bib">1</ref> Digital computing was in its infancy, and Dwyer focused on computation with mechanical calculators. Nonetheless, the book was state of the art. <ref rid="c10501" type="fig">Figure 1</ref> reproduces a page of the book dealing with Gaussian elimination. In 1954, Alston S. Householder published <it>Principles of Numerical Analysis</it>,<ref rid="bibc10502" type="bib">2</ref> one of the first modern treatments of high-speed digital computation. <ref rid="c10502" type="fig">Figure 2</ref> reproduces a page from this book, also dealing with Gaussian elimination.</ip1><fig id="c10501"><art file="c10501.gif" w="600" h="970" tw="150" th="243"/><no>1</no><fgc>This page from <it>Linear Computations</it> shows that Paul Dwyer's approach begins with a system of scalar equations. Courtesy of John Wiley & Sons.</fgc></fig><fig id="c10502"><art file="c10502.gif" w="500" h="807" tw="150" th="242"/><no>2</no><fgc>On this page from <it>Principles of Numerical Analysis</it>, Alston Householder uses partitioned matrices and LU decomposition. Courtesy of McGraw-Hill.</fgc></fig>The contrast between these two excerpts is striking. The most obvious difference is that Dwyer used scalar equations whereas Householder used partitioned matrices. …

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support…<sec><st>CONCLUSION</st><ip1>The big six are not the only decompositions in use; in fact, there are many more. As mentioned earlier, certain intermediate forms—such as tridiagonal and Hessenberg forms—have come to be regarded as decompositions in their own right. Since the singular value decomposition is expensive to compute and not readily updated, rank-revealing alternatives have received considerable attention.<ref rid="bibc105054" type="bib">54</ref><super>,</super><ref rid="bibc105055" type="bib">55</ref> There are also generalizations of the singular value decomposition and the Schur decomposition for pairs of matrices.<ref rid="bibc105056" type="bib">56</ref><super>,</super><ref rid="bibc105057" type="bib">57</ref> All crystal balls become cloudy when they look to the future, but it seems safe to say that as long as new matrix problems arise, new decompositions will be devised to solve them.</ip1></sec></bdy><bm><ack><h>Acknowledgment</h><ip1><it>This work was supported by the National Science Foundation under Grant No. 970909-8562.</it></ip1></ack><bib><bibl><h>References</h><bb id="bibc10501"><au><fnm>P.S.</fnm><snm>Dwyer</snm></au><ti>Linear Computations,</ti><obi>John Wiley & Sons,</obi><loc><cty>New York,</cty></loc><pdt><yr>1951.</yr></pdt></bb><bb id="bibc10502"><au><fnm>A.S.</fnm><snm>Householder</snm></au><ti>Principles of Numerical Analysis,</ti><obi>McGraw-Hill,</obi><loc><cty>New York,</cty></loc><pdt><yr>1953.</yr></pdt></bb><bb id="bibc10503"><au><fnm>J.H.</fnm><snm>Wilkinson</snm></au><obi>and</obi><au><fnm>C.</fnm><snm>Reinsch</snm></au><ti>Handbook for Automatic Computation, Vol. II, Linear Algebra,</ti><obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt><yr>1971.</yr></pdt></bb><bb id="bibc10504"><au><fnm>B.S.</fnm><snm>Garbow</snm></au><obi>et al.,</obi><atl>"Matrix Eigensystem Routines—Eispack Guide Extension,"</atl><ti>Lecture Notes in Computer Science,</ti><obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt><yr>1977.</yr></pdt></bb><bb id="bibc10505"><au><fnm>J.J.</fnm><snm>Dongarra</snm></au><obi>et al.,</obi><ti>LINPACK User's Guide,</ti><obi>SIAM,</obi><loc><cty>Philadelphia,</cty></loc><pdt><yr>1979.</yr></pdt></bb> …

• INEX CAS QueryINEX CAS Query<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE inex_topic SYSTEM "topic.dtd"><inex_topic topic_id="70" query_type="CAS" ct_no="49"><title> /article[about(./fm/abs,'"information retrieval" "digital libraries"')]</title><description>Retrieve articles with an abstract indicating the articleis about information retrieval and/or digital libraries</description><narrative>To be relevant the retrieved articles must be aboutinformation retrieval, digital libraries or, preferably both. Articlesabout information retrieval from digital libraries will receive thehighest relevance judgements.</narrative><keywords>information retrieval,digital libraries</keywords></inex_topic>



• Configuration files for the Server are also Configuration files for the Server are also SGML/XML:SGML/XML:– They include tags describing all of the data files They include tags describing all of the data files

and indexes for the database.and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to

be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.

Cheshire Configuration FilesCheshire Configuration FilesCheshire Configuration FilesCheshire Configuration Files<DBCONFIG><DBENV>/projects/is240/GroupX/indexes </DBENV>



<FILEDEF TYPE=SGML>

<DEFAULTPATH>/projects/is240/GroupX </DEFAULTPATH>

<FILETAG> trec </FILETAG>

<FILENAME> /projects/is240/ft </FILENAME>

<CONTINCLUDE> /projects/is240/ft.CONT </CONTINCLUDE>

<FILEDTD> /projects/is240/TREC.FT.DTD </FILEDTD><ASSOCFIL> ft.assoc </ASSOCFIL>

<HISTORY> cheshire_index/TESTDATA.history </HISTORY>…


IndexingIndexingIndexingIndexing

• Any SGML/XML tagged field or attribute can be Any SGML/XML tagged field or attribute can be indexed:indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)B-Tree and Hash access via Berkeley DB (Sleepycat)– Stemming, keyword, exact keys and “special keys”Stemming, keyword, exact keys and “special keys”– Mapping from any Z39.50 Attribute combination to a Mapping from any Z39.50 Attribute combination to a

specific indexspecific index– Underlying postings information includes term frequency Underlying postings information includes term frequency

for probabilistic searching.for probabilistic searching.– SGML may include address of full-text for indexingSGML may include address of full-text for indexing

• New indexes can be easily added, or old ones deletedNew indexes can be easily added, or old ones deleted


Bitmapped IndexesBitmapped IndexesBitmapped IndexesBitmapped Indexes

• Bitmap indexes can be used for Boolean Bitmap indexes can be used for Boolean operations where the data has only a few operations where the data has only a few values and very large numbers of items with values and very large numbers of items with each valueeach value

• Only one bit per record stored in the indexOnly one bit per record stored in the index

• Processed on a demand basis so only blocks Processed on a demand basis so only blocks with the bits needed to resolve a query are with the bits needed to resolve a query are fetchedfetched

<INDEXES>

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE PRIMARYKEY=IGNORE><INDXNAME> cheshire_index/trec.docno.index </INDXNAME><INDXTAG> docno </INDXTAG>

<INDXMAP><USE> 12 </USE><struct> 1 </struct> </INDXMAP>



<INDXKEY><TAGSPEC><FTAG>DOCNO </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>…

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> cheshire_index/trec.topic.index </INDXNAME><INDXTAG> topic </INDXTAG>

<INDXMAP><USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXMAP><USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>…<STOPLIST> cheshire_index/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>HEADLINE </FTAG><FTAG>DATELINE </FTAG><FTAG>BYLINE </FTAG><FTAG>TEXT </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>


Cheshire II – EVI GenerationCheshire II – EVI GenerationCheshire II – EVI GenerationCheshire II – EVI Generation

• Entry Vocabulary Indexes can improve access to data with Entry Vocabulary Indexes can improve access to data with controlled index termscontrolled index terms

• Define Define basisbasis for clustering records. for clustering records.– Select field to form the Select field to form the basis basis of the cluster.of the cluster.– EvidenceEvidence Fields to use as contents of the pseudo-documents. Fields to use as contents of the pseudo-documents.

• During indexing cluster keys are generated with During indexing cluster keys are generated with basisbasis and and evidenceevidence from each record. from each record.

• Cluster keys are sorted and merged on basis and pseudo-Cluster keys are sorted and merged on basis and pseudo-documents created for each unique documents created for each unique basisbasis element element containing all evidence fields.containing all evidence fields.

• Pseudo-Documents (Class clusters) are indexed on Pseudo-Documents (Class clusters) are indexed on combined evidence fields.combined evidence fields.

EVI/Cluster DefinitionsEVI/Cluster DefinitionsEVI/Cluster DefinitionsEVI/Cluster Definitions

<CLUSTER><clusname> classcluster </clusname><cluskey normal=CLASSCLUS>

<tagspec><FTAG>FLD950 </FTAG> <s> â </s>

</tagspec></cluskey><stoplist> /usr3/cheshire2/data2/clasclusstoplist </stoplist><clusmap>

<from> <tagspec><ftag>FLD245</ftag><s>^[ab]</s><ftag>FLD440</ftag><s>â</s><ftag>FLD490</ftag><s>â</s><ftag>FLD830</ftag><s>â</s><ftag>FLD740</ftag><s>â</s>

</tagspec></from><to> <tagspec>

<ftag>titles</ftag> </tagspec></to><from> <tagspec>

<ftag>FLD6..</ftag><s>^[abcdxyz]</s> </tagspec></from><to> <tagspec>

<ftag>subjects</ftag> </tagspec></to><summarize> <maxnum> 5 </maxnum>

<tagspec> <ftag>subjsum</ftag></tagspec></summarize>

</clusmap></CLUSTER>


Component Extraction and Component Extraction and IndexingIndexing

Component Extraction and Component Extraction and IndexingIndexing

• Any element (or range of SGML/XML data Any element (or range of SGML/XML data starting with one element and ending with starting with one element and ending with another) can be defined as a ‘component’ another) can be defined as a ‘component’ and accessed and indexed as if it were an and accessed and indexed as if it were an entire document.entire document.

• Component indexes and document-level Component indexes and document-level indexes can be combined in search indexes can be combined in search operations (and special operators permit operations (and special operators permit selection of document or components as the selection of document or components as the resultresult

Component DefinitionsComponent DefinitionsComponent DefinitionsComponent Definitions<COMPONENTS><COMPONENTDEF><COMPONENTNAME> TESTDATA/COMPONENT_DB1 </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG> <TAGSPEC> <FTAG>mainenty </FTAG> <FTAG>titles </FTAG> </TAGSPEC></COMPSTARTTAG><COMPENDTAG> <TAGSPEC><FTAG>Fld300 </FTAG></TAGSPEC></COMPENDTAG><COMPONENTINDEXES><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> TESTDATA/comp1index1.author…</INDEXDEF></COMPONENTDEF></COMPONENTS>

Result Formatting (Display)Result Formatting (Display)Result Formatting (Display)Result Formatting (Display)<DISPOPTIONS>KEEP_ENTITIES</DISPOPTIONS>

<DISPLAY> <FORMAT NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="TAGSET-G"> <clusmap> <from> <tagspec> <ftag>DOCNO</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> </clusmap> </convert></FORMAT></DISPLAY>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example

<DBCONFIG><DBENV>/projects/metadata/cheshire/TREC/cheshire_index </DBENV>



<FILEDEF TYPE=XML><DEFAULTPATH> /projects/metadata/cheshire/INEX </DEFAULTPATH><FILETAG> INEX </FILETAG>

<FILENAME> inex-1.3/xml </FILENAME>

<CONTINCLUDE> inex-1.3/xml_main.cont </CONTINCLUDE>

<FILEDTD> inex-1.3/dtd/wrapper.dtd </FILEDTD><SGMLCAT> inex-1.3/dtd/catalog </SGMLCAT>

<ASSOCFIL> inex-1.3/xml_main.assoc </ASSOCFIL>

<HISTORY> inex.history </HISTORY>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXES>

<INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=DO_NOT_NORMALIZE PRIMARYKEY=IGNORE><INDXNAME> indexes/docno.index </INDXNAME><INDXTAG> docno </INDXTAG>




<INDXKEY><TAGSPEC><FTAG> doi </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/pauthor.index</INDXNAME><INDXTAG> pauthor </INDXTAG>

<INDXMAP> <USE> 1 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXMAP> <USE> 1004 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/authorstoplist </STOPLIST>

<INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>au</S><S>snm</S><FTAG>fm</FTAG><S>au</S><S>fnm</S></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM><INDXNAME> indexes/title.index </INDXNAME><INDXTAG> title </INDXTAG>

<INDXMAP><USE> 4 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<INDXMAP> <USE> 5 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>


<STOPLIST> indexes/titlestoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>tig</S><S>atl</S></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM><INDXNAME> indexes/topic.index </INDXNAME><INDXTAG> topic </INDXTAG>

<INDXMAP><USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>…<INDXMAP><USE> 1017 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>tig</S><S>atl</S><FTAG>abs</FTAG><FTAG>bdy</FTAG><FTAG>bibl</FTAG><S>bb</S><S>atl</S><FTAG>app</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR><INDXNAME> indexes/date.index</INDXNAME><INDXTAG> date</INDXTAG>

<INDXMAP><USE> 30 </USE><POSIT> 3 </posit> <struct> 6 </struct></INDXMAP><INDXMAP><USE> 30 </USE><POSIT> 3 </posit> <struct> 5 </struct></INDXMAP>

<INDXKEY><TAGSPEC>

<FTAG>hdr2</FTAG><s>yr</s></TAGSPEC></INDXKEY></INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/journal.index</INDXNAME><INDXTAG> journal</INDXTAG>

<INDXMAP><USE> 1022 </USE><POSIT> 3 </posit> <struct> 6 </struct></INDXMAP><INDXMAP><USE> 1022 </USE><POSIT> 3 </posit> <struct> 5 </struct></INDXMAP>

<INDXKEY><TAGSPEC>

<FTAG>hdr1</FTAG><s>ti</s></TAGSPEC></INDXKEY></INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/keywords.index </INDXNAME><INDXTAG> kwd </INDXTAG>


<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>kwd</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/abstract.index </INDXNAME><INDXTAG> abstract </INDXTAG>


<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>abs</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/author_seq.index</INDXNAME><INDXTAG> author_seq </INDXTAG><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>au</S><ATTR>sequence</ATTR></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/bib_author_fnm.index</INDXNAME><INDXTAG> bib_author_fnm </INDXTAG>


<INDXKEY><TAGSPEC><FTAG>bb</FTAG><s>au</s><s>fnm</s> </TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/bib_author_snm.index</INDXNAME><INDXTAG> bib_author_snm </INDXTAG>


<INDXKEY><TAGSPEC><FTAG>bb</FTAG><s>au</s><s>snm</s> </TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM><INDXNAME> indexes/fig.index </INDXNAME><INDXTAG> fig </INDXTAG>


<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fig</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM><INDXNAME> indexes/ack.index </INDXNAME><INDXTAG> ack </INDXTAG>


<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>ack</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/alltitles.index </INDXNAME><INDXTAG> alltitles </INDXTAG>


<STOPLIST> indexes/titlestoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>atl</FTAG><FTAG>st</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/affil.index </INDXNAME><INDXTAG> affil </INDXTAG>


<STOPLIST> indexes/titlestoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fm</FTAG><s>aff</s></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=none><INDXNAME> indexes/fno.index </INDXNAME><INDXTAG> fno </INDXTAG>

<INDXMAP><USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>fno</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=INTEGER NORMAL=NONE><INDXNAME> indexes/figno.index </INDXNAME><INDXTAG> figno </INDXTAG>

<INDXMAP><USE> 3193 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>fig</FTAG><s>no</s></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/topicshort.index </INDXNAME><INDXTAG> topicshort </INDXTAG>

<INDXMAP><USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>tig</S><S>atl</S><FTAG>abs</FTAG><FTAG>kwd</FTAG><FTAG>st</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

</INDEXES>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTS><COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_SECTION </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>sec</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES>

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE><INDXNAME> indexes/sec_title2.index</INDXNAME><INDXTAG> sec_title </INDXTAG><INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><STOPLIST> indexes/titlestoplist </STOPLIST>

<INDXKEY><TAGSPEC><FTAG>sec</FTAG><s>st</s></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/sec_words.index</INDXNAME><INDXTAG> sec_words </INDXTAG>

<INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/topicstoplist </STOPLIST>

<INDXKEY><TAGSPEC><FTAG>sec</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

</COMPONENTINDEXES></COMPONENTDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_BIB </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>bm</FTAG><S>bib</S><s>bibl</s><s>bb</s></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/bib_author.index</INDXNAME><INDXTAG> bib_author </INDXTAG><INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<INDXKEY><TAGSPEC><FTAG>au</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE><INDXNAME> indexes/bib_title.index</INDXNAME><INDXTAG> bib_title </INDXTAG>

<INDXMAP> <USE> 33 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<INDXKEY><TAGSPEC><FTAG>atl</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR><INDXNAME> indexes/bib_date.index</INDXNAME><INDXTAG> bib_date </INDXTAG>

<INDXMAP> <USE> 31 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<INDXKEY><TAGSPEC><FTAG>pdt</FTAG><s>yr</s> </TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>îlrj$|îp1$|îp2$|îp3$|îp4$|îp5$|îtem-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_FIG </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>fig</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/fig_caption.index</INDXNAME><INDXTAG> fig_caption </INDXTAG><INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> indexes/titlestoplist </STOPLIST>

<INDXKEY><TAGSPEC><FTAG>fgc</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF></COMPONENTINDEXES></COMPONENTDEF>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_VITAE </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>vt</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE><INDXNAME> indexes/vitae_words.index</INDXNAME><INDXTAG> vt_vitae </INDXTAG><INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><STOPLIST> indexes/titlestoplist </STOPLIST>

<INDXKEY><TAGSPEC><FTAG>vt</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF></COMPONENTINDEXES></COMPONENTDEF></COMPONENTS>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<DISPOPTIONS>KEEP_ENTITIES</DISPOPTIONS>

<DISPLAY> <DISPLAYDEF NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="MIXED"> <clusmap> <from> <tagspec> <ftag>doi</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> <from> <tagspec> <ftag>#DBNAME#</ftag> </tagspec></from>…

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<DISPLAYDEF name="XML_ELEMENT_" OID="1.2.840.10003.5.109.10"><convert function="XML_ELEMENT"> <clusmap> <from> <tagspec> <ftag>#FILENAME#</ftag> </tagspec></from> <to> <tagspec> <ftag>FILENAME</ftag> </tagspec></to> <from> <tagspec> <ftag>#RANK#</ftag> </tagspec></from> <to> <tagspec> <ftag>RANK </ftag> </tagspec></to> …

…<from> <tagspec> <ftag>#RAWSCORE#</ftag> </tagspec></from> <to> <tagspec> <ftag>RAWSCORE </ftag> </tagspec></to> <from> <tagspec> <ftag> SUBST_ELEMENT </ftag> </tagspec></from> <to> <tagspec> <ftag> SUBST_ELEMENT </ftag> </tagspec> </to> </clusmap></convert></DISPLAYDEF>

</DISPLAY></FILEDEF>

</DBCONFIG>

INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>îlrj$|îp1$|îp2$|îp3$|îp4$|îp5$|îtem-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>


XML Schemas and Element XML Schemas and Element RetrievalRetrieval


XML Schema SupportXML Schema SupportXML Schema SupportXML Schema Support

• XML Schemas or DTD’s can be used to XML Schemas or DTD’s can be used to define the data contentsdefine the data contents

• Tested with a wide variety of schemas Tested with a wide variety of schemas including METS (with various supporting including METS (with various supporting schemas)schemas)


XML Element ExtractionXML Element ExtractionXML Element ExtractionXML Element Extraction

• A new search “ElementSetName” is A new search “ElementSetName” is XML_ELEMENT_XML_ELEMENT_

• Any Xpath, element name, or regular Any Xpath, element name, or regular expression can be included following the expression can be included following the final underscore when submitting a present final underscore when submitting a present request (Note only a subset of full Xpath is request (Note only a subset of full Xpath is available)available)

• The matching elements are extracted from The matching elements are extracted from the records matching the search and the records matching the search and delivered in a simple format..delivered in a simple format..

XML ExtractionXML ExtractionXML ExtractionXML Extraction

% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…


Database Storage Database Storage Database Storage Database Storage

• All data stored as SGML/XML flat text files plus All data stored as SGML/XML flat text files plus optional linked full-text (non-XML) files optional linked full-text (non-XML) files

• File format is defined though SGML/XML DTD File format is defined though SGML/XML DTD (also flat text file) or Schema(also flat text file) or Schema

• ““Associator” files provide indexed direct access to Associator” files provide indexed direct access to each record in SGML/XML files.each record in SGML/XML files.– Contain offset and record length for each “record”Contain offset and record length for each “record”

– Associators can be built to index any conformant Associators can be built to index any conformant document in a directory sub-treedocument in a directory sub-tree


INEX CO RunsINEX CO RunsINEX CO RunsINEX CO Runs

• Three official, one later run - all Title-onlyThree official, one later run - all Title-only– Fusion - Combines Okapi and LR using the Fusion - Combines Okapi and LR using the

MERGE_CMBZ operatorMERGE_CMBZ operator– NewParms (LR)- Using only LR with the new NewParms (LR)- Using only LR with the new

parametersparameters– Feedback - An attempt at blind relevance Feedback - An attempt at blind relevance

feedbackfeedback

– PostFusion - Fusion of the new LR coefficients PostFusion - Fusion of the new LR coefficients and Okapiand Okapi


Query Generation - COQuery Generation - COQuery Generation - COQuery Generation - CO

• # 162 TITLE = Text and Index Compression # 162 TITLE = Text and Index Compression Algorithms Algorithms

• QUERY: QUERY: topicshort @+ {Text and Index topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !{Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms})Compression Algorithms})

• @+ is Okapi, @ is LR@+ is Okapi, @ is LR• !MERGE_CMBZ is a normalized score summation !MERGE_CMBZ is a normalized score summation

and enhancementand enhancement


INEX CO Runs INEX CO Runs INEX CO Runs INEX CO Runs

Generalized Strict

Avg PrecFUSION = 0.0642NEWPARMS = 0.0582FDBK = 0.0415POSTFUS = 0.0690

Avg PrecFUSION = 0.0923NEWPARMS = 0.0853FDBK = 0.0390POSTFUS = 0.0952


INEX VCAS RunsINEX VCAS RunsINEX VCAS RunsINEX VCAS Runs

• Two official runsTwo official runs– FUSVCAS - Element fusion using LR and FUSVCAS - Element fusion using LR and

various operators for path restrictionvarious operators for path restriction– NEWVCAS - Using the new LR coefficients NEWVCAS - Using the new LR coefficients

for each appropriate index and various for each appropriate index and various operators for path restrictionoperators for path restriction


Query Generation - VCASQuery Generation - VCASQuery Generation - VCASQuery Generation - VCAS

• #66 TITLE = //article[about(., intelligent #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board transport systems)]//sec[about(., on-board route planning navigation system for route planning navigation system for automobiles)]automobiles)]

• Submitted query = Submitted query = ((topic @ {intelligent ((topic @ {intelligent transport systems})) !RESTRICT_FROM transport systems})) !RESTRICT_FROM ((sec_words @ {on-board route planning ((sec_words @ {on-board route planning navigation system for automobiles}))navigation system for automobiles}))

• Target elements: sec|ss1|ss2|ss3Target elements: sec|ss1|ss2|ss3


VCAS ResultsVCAS ResultsVCAS ResultsVCAS Results

Generalized Strict

Avg PrecFUSVCAS = 0.0321NEWVCAS = 0.0270

Avg PrecFUSVCAS = 0.0601NEWVCAS = 0.0569


Heterogeneous TrackHeterogeneous TrackHeterogeneous TrackHeterogeneous Track

• Approach using the Cheshire’s Virtual Approach using the Cheshire’s Virtual Database optionsDatabase options– Primarily a version of distributed IRPrimarily a version of distributed IR– Each collection indexed separatelyEach collection indexed separately– Search via Z39.50 distributed queriesSearch via Z39.50 distributed queries– Z39.50 Attribute mapping used to map query Z39.50 Attribute mapping used to map query

indexes to appropriate elements in a given indexes to appropriate elements in a given collectioncollection

– Only LR used and collection results merged using Only LR used and collection results merged using probability of relevance for each collection resultprobability of relevance for each collection result


Heterogeneous Track IssuesHeterogeneous Track IssuesHeterogeneous Track IssuesHeterogeneous Track Issues

• Very large “Documents” Very large “Documents” – Our approach was to segmentOur approach was to segment

• Reporting Xpath after segmenting large Reporting Xpath after segmenting large documentsdocuments


Database StorageDatabase StorageDatabase StorageDatabase Storage

AssociatorFile

Page DataFile

SGML/XMLFile

HistoryFile

DTDFileCluster

File

PostingsFile

IndexFile

IndexFile

RemoteRDBMS

ConfigFile

IndexFile

AssociatorFile

Prox data File


Client/Server ArchitectureClient/Server ArchitectureClient/Server ArchitectureClient/Server Architecture

• Server Supports:Server Supports:– Database storageDatabase storage– Indexing Indexing – Z39.50 access to local dataZ39.50 access to local data– Boolean and Probabilistic SearchingBoolean and Probabilistic Searching– Relevance FeedbackRelevance Feedback– External SQL database supportExternal SQL database support

• Client Supports:Client Supports:– Programmable (Tcl/Tk – Python soon) Graphical User InterfaceProgrammable (Tcl/Tk – Python soon) Graphical User Interface– Z39.50 access to remote serversZ39.50 access to remote servers– SGML & MARC formattingSGML & MARC formatting

• Combined Client/Server CGI scripting via WebCheshireCombined Client/Server CGI scripting via WebCheshire


Z39.50 OverviewZ39.50 OverviewZ39.50 OverviewZ39.50 Overview

UI

UI

MapQuery

Internet

MapResults

MapQuery

MapResults

MapQuery

MapResults

SearchEngine


Two Protocols: HTTP & Z39.50Two Protocols: HTTP & Z39.50

SYSTEM BEHAVIOR HTTP Z39.50State maintenance client serverSessions no yesPolicies adaptable to link speed no yesSynch/asynch synch bothFixed/negotiated protocol fixed negFixed/negotiated doc formats none NegStandardized Metadata no yes


Server Z39.50 SupportServer Z39.50 SupportServer Z39.50 SupportServer Z39.50 Support

• Locally developed Z39.50 LibraryLocally developed Z39.50 Library

• Extended version 3 supportExtended version 3 support– support version 3 attributes in BIB-1 including support version 3 attributes in BIB-1 including

“stem”, “relevance”, etc. Also adding support “stem”, “relevance”, etc. Also adding support for “type 102” ranked queries (version 4)for “type 102” ranked queries (version 4)

• Can provide both MARC, SUTRS and Can provide both MARC, SUTRS and SGML records, support for Explain and SGML records, support for Explain and GRS-1 conversion of any SGML recordsGRS-1 conversion of any SGML records


Distributed SearchDistributed SearchDistributed SearchDistributed Search


The ProblemThe ProblemThe ProblemThe Problem• The Digital Library vision -- Access to everyone The Digital Library vision -- Access to everyone

for “all human knowledge”for “all human knowledge”• Lyman and Varian’s estimates of the “Dark Web”Lyman and Varian’s estimates of the “Dark Web”• Hundreds or Thousands of servers with databases Hundreds or Thousands of servers with databases

ranging widely in content, topic, formatranging widely in content, topic, format– Broadcast search is expensive in terms of bandwidth Broadcast search is expensive in terms of bandwidth

and in processing too many irrelevant resultsand in processing too many irrelevant results– How to select the “best” ones to search?How to select the “best” ones to search?

• Which resource to search first?Which resource to search first?• Which to search next if more is wanted?Which to search next if more is wanted?

– Topical /domain constraints on the search selectionsTopical /domain constraints on the search selections– Variable contents of database (metadata only, full text, Variable contents of database (metadata only, full text,

multimedia…)multimedia…)


Distributed Search TasksDistributed Search TasksDistributed Search TasksDistributed Search Tasks• Resource DescriptionResource Description

– How to collect metadata about digital libraries and their How to collect metadata about digital libraries and their collections or databasescollections or databases

• Resource SelectionResource Selection– How to select relevant digital library collections or databases How to select relevant digital library collections or databases

from a large number of databasesfrom a large number of databases

• Distributed SearchDistributed Search– How to perform parallel or sequential searching over the How to perform parallel or sequential searching over the

selected digital library databasesselected digital library databases

• Data FusionData Fusion– How to merge query results from different digital libraries with How to merge query results from different digital libraries with

their different search engines, differing record structures, etc.their different search engines, differing record structures, etc.


An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery

An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery

• Distributed resource representation and discoveryDistributed resource representation and discovery– New approach to building resource descriptions based on New approach to building resource descriptions based on

Z39.50Z39.50– Instead of using Instead of using broadcastbroadcast search across resources we are using search across resources we are using

two Z39.50 Servicestwo Z39.50 Services• Identification of database metadata using Z39.50 Identification of database metadata using Z39.50 ExplainExplain• Extraction of distributed indexes using Z39.50 Extraction of distributed indexes using Z39.50 SCANSCAN

• Evaluation Evaluation – How efficiently can we build distributed indexes? How efficiently can we build distributed indexes? – How effectively can we choose databases using the index?How effectively can we choose databases using the index?– How effective is merging search results from multiple sources?How effective is merging search results from multiple sources?– Can we build hierarchies of servers Can we build hierarchies of servers

(general/meta-topical/individual)?(general/meta-topical/individual)?


Z39.50 ExplainZ39.50 ExplainZ39.50 ExplainZ39.50 Explain

• Explain supports searches for Explain supports searches for – Server-Level metadata Server-Level metadata

• Server NameServer Name

• IP AddressesIP Addresses

• Ports Ports

– Database-Level metadataDatabase-Level metadata• Database nameDatabase name

• Search attributes (indexes and combinations) Search attributes (indexes and combinations)

– Support metadata (record syntaxes, etc)Support metadata (record syntaxes, etc)


Z39.50 SCANZ39.50 SCANZ39.50 SCANZ39.50 SCAN

• Originally intended to support Browsing Originally intended to support Browsing • Query for Query for

– DatabaseDatabase– Attributes plus Term (i.e., index and start point)Attributes plus Term (i.e., index and start point)– Step SizeStep Size– Number of terms to retrieveNumber of terms to retrieve– Position in Response setPosition in Response set

• Results Results – Number of terms returnedNumber of terms returned– List of Terms and their frequency in the database (for List of Terms and their frequency in the database (for

the given attribute combination)the given attribute combination)


Z39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN Results% zscan title cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 27}{cat-fight 1}{catalan 19}{catalogu 37}{catalonia 8}{catalyt 2}{catania 1}{cataract 1}{catch 173}{catch-all 3}{catch-up 2} …

zscan topic cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 706}{cat-and-mouse 19}{cat-burglar 1}{cat-carrying 1}{cat-egory 1}{cat-fight 1}{cat-gut 1}{cat-litter 1}{cat-lovers 2}{cat-pee 1}{cat-run 1}{cat-scanners 1} …

Syntax: zscan indexname1 term stepsize number_of_terms pref_pos


Resource Index CreationResource Index CreationResource Index CreationResource Index Creation• For all servers, or a topical subset…For all servers, or a topical subset…

– Get Explain information Get Explain information – For each indexFor each index

• Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency• Add term + freq + source index + database metadata Add term + freq + source index + database metadata

to the XML “Collection Document” for the resourceto the XML “Collection Document” for the resource– Planned extensions:Planned extensions:

• Post-Process indexes (especially Geo Names, etc) Post-Process indexes (especially Geo Names, etc) for special types of data for special types of data

– e.g. create “geographical coverage” indexese.g. create “geographical coverage” indexes


MetaSearch ApproachMetaSearch ApproachMetaSearch ApproachMetaSearch Approach

MetaSearchServer

Map ExplainAnd ScanQueries

Internet

MapResults

MapQuery

MapResults

SearchEngine

DB2DB 1

MapQuery

MapResults

SearchEngine

DB 4DB 3

DistributedIndex

SearchEngine

Db 6Db 5


Known Issues and ProblemsKnown Issues and ProblemsKnown Issues and ProblemsKnown Issues and Problems

• Not all Z39.50 Servers support SCAN or ExplainNot all Z39.50 Servers support SCAN or Explain• Solutions that appear to work well:Solutions that appear to work well:

– Probing for attributes instead of explain (e.g. DC Probing for attributes instead of explain (e.g. DC attributes or analogs)attributes or analogs)

– We also support OAI and can extract OAI metadata for We also support OAI and can extract OAI metadata for servers that support OAIservers that support OAI

– Query-based sampling (Callan)Query-based sampling (Callan)

• Collection Documents are static and need to be Collection Documents are static and need to be replaced when the associated collection changesreplaced when the associated collection changes


Evaluation Evaluation Evaluation Evaluation

• Test EnvironmentTest Environment– TREC Tipster data (approx. 3 GB)TREC Tipster data (approx. 3 GB)

– Partitioned into 236 smaller collections based on source Partitioned into 236 smaller collections based on source and date by month (no DOE)and date by month (no DOE)

• High size variability (from 1 to thousands of records)High size variability (from 1 to thousands of records)

• Same database as used in other distributed search studies by J. Same database as used in other distributed search studies by J. French and J. Callan among othersFrench and J. Callan among others

– Used TREC topics 51-150 for evaluation (these are the Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 only topics with relevance judgements for all 3 TIPSTER disksTIPSTER disks


Harvesting EfficiencyHarvesting EfficiencyHarvesting EfficiencyHarvesting Efficiency

• Tested using the databases on the previous slide + Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb)the full FT database (210,158 records ~ 600 Mb)

• Average of 23.07 seconds per database to SCAN Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create each database (3.4 indexes on average) and create a collection representative, over the networka collection representative, over the network

• Average of 14.07 secondsAverage of 14.07 seconds• Also tested larger databases (E.g. TREC FT Also tested larger databases (E.g. TREC FT

database ~600 Mb with 7 indexes was harvested in database ~600 Mb with 7 indexes was harvested in 131 seconds. 131 seconds.


Our Collection Ranking Our Collection Ranking ApproachApproach

Our Collection Ranking Our Collection Ranking ApproachApproach

• We attempt to estimate the probability of We attempt to estimate the probability of relevance for a given collection with respect to relevance for a given collection with respect to a query using the Logistic Regression method a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for Dabney, A. Chen) with new algorithm for weight calculation at retrieval timeweight calculation at retrieval time

• Estimates from multiple extracted indexes are Estimates from multiple extracted indexes are combined to provide an overall ranking score combined to provide an overall ranking score for a given resource (I.e., fusion of multiple for a given resource (I.e., fusion of multiple query results)query results)




6

10),|(

iii XccCQRP

Probability of relevance for a given index is based on logistic regression from a sample set documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:




MX

n

nNICF

ICFM

X

CLX

CAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

10

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Collection Frequency

Collection size estimate

Average Inverse Collection Frequency

Inverse Document Frequency (N = Number of collections

M = Number of Terms in common between query and document


EvaluationEvaluationEvaluationEvaluation• Effectiveness Effectiveness

– Tested using the collection representatives described Tested using the collection representatives described above (as harvested from over the network) and the above (as harvested from over the network) and the TIPSTER relevance judgements TIPSTER relevance judgements

– Testing by comparing our approach to known Testing by comparing our approach to known algorithms for ranking collectionsalgorithms for ranking collections

– Results were measured against reported results for the Results were measured against reported results for the Ideal and CORI algorithms and against the optimal Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX)“Relevance Based Ranking” (MAX)

– Recall analog (How many of the Rel docs occurred in Recall analog (How many of the Rel docs occurred in the top n databases – averaged)the top n databases – averaged)


Titles only (short query)Titles only (short query)Titles only (short query)Titles only (short query)

R̂


FutureFutureFutureFuture

• Logically Clustering servers by topicLogically Clustering servers by topic

• Meta-Meta Servers (treating the Meta-Meta Servers (treating the MetaSearch database as just another MetaSearch database as just another database)database)


Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers

Replicatedservers

Meta-TopicalServers

General ServersDatabaseServers


Geographic Operators and Search Geographic Operators and Search RankingRanking

Geographic Operators and Search Geographic Operators and Search RankingRanking


The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations

• Operators established for the GEO Z39.50 profileOperators established for the GEO Z39.50 profile• Implemented using special operations on indexesImplemented using special operations on indexes• Indexing allows extraction of geographic Indexing allows extraction of geographic

coordinates and dates from SGML/XML data in a coordinates and dates from SGML/XML data in a variety of formatsvariety of formats

• Normalized internal representation in indexesNormalized internal representation in indexes• Search using geographic and time elements as Search using geographic and time elements as

primary or limiting search elementsprimary or limiting search elements


The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations

• X-based interfaces permit (simple) map X-based interfaces permit (simple) map drawing and searchdrawing and search

• Interface to MapServer for web-based map Interface to MapServer for web-based map searchingsearching

GEO Geographic operatorsGEO Geographic operatorsGEO Geographic operatorsGEO Geographic operators>=< >=< OverlapOverlap Search region and data OverlapSearch region and data Overlap

>#< >#< Fully EnclosedFully Enclosed Data fully enclosed in search reg.Data fully enclosed in search reg.

<#><#> EnclosesEncloses Data fully encloses search regionData fully encloses search region

<>#<># Fully Outside Fully Outside Data outside of search regionData outside of search region

++++ NearNear Data is near search regionData is near search region

:<::<: BeforeBefore Data date is before search dateData date is before search date

:<=::<=: Before or Before or DuringDuring

Data date is before or during Data date is before or during search datesearch date

:>=::>=: During or During or AfterAfter

Data date is during or after search Data date is during or after search datedate

:>::>: AfterAfter Data date is after search dateData date is after search date


Overlaps searchOverlaps searchOverlaps searchOverlaps search


Fully Enclosed SearchFully Enclosed SearchFully Enclosed SearchFully Enclosed Search


Map-Based SearchMap-Based SearchMap-Based SearchMap-Based Search


GeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web Interface


MySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQL


RDBMS SupportRDBMS SupportRDBMS SupportRDBMS Support

• There are two reasons for RDBMS supportThere are two reasons for RDBMS support– IR systems are not meant for LOTS of update IR systems are not meant for LOTS of update

transactionstransactions

– Some application need to have access to both relational Some application need to have access to both relational data and text data via Z39.50data and text data via Z39.50

• Both MySQL and PostgreSQL are popular open Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used source RDBMS and now either can now be used via Cheshirevia Cheshire– Z39.50 mappings to RDBMS columnsZ39.50 mappings to RDBMS columns

– ““ZQL” submission of SQL as Z39.50 Type 0 queryZQL” submission of SQL as Z39.50 Type 0 query


Protocol SupportProtocol SupportProtocol SupportProtocol Support


ProtocolsProtocolsProtocolsProtocols

• In Cheshire II most protocols (except In Cheshire II most protocols (except Z39.50) are implemented using scriptingZ39.50) are implemented using scripting

• Example scripts to support the following Example scripts to support the following are included in the distribution are included in the distribution – OAIOAI– SRW (Python version)SRW (Python version)– SOAPSOAP– SDLIPSDLIP


Cheshire III Design and Cheshire III Design and DevelopmentDevelopment

Cheshire III Design and Cheshire III Design and DevelopmentDevelopment


Cheshire III GoalsCheshire III GoalsCheshire III GoalsCheshire III Goals• Retain or reproduce (and refine) all Cheshire II Retain or reproduce (and refine) all Cheshire II

featuresfeatures– ““Spring cleaning” of code baseSpring cleaning” of code base– Add Full Unicode Support Add Full Unicode Support – Store most system and content data in the databaseStore most system and content data in the database

• Permit easy and efficient integration in Web Permit easy and efficient integration in Web ServicesServices

• Use threaded server for economy of resource usageUse threaded server for economy of resource usage• Enhanced Multiprotocol support Enhanced Multiprotocol support • Support for distributed processing (I.e. GRID Support for distributed processing (I.e. GRID

clusters)clusters)• Enhance expandability and “drop in’ functionalityEnhance expandability and “drop in’ functionality• Interfaces and/or APIs for Java, Python, C/C++Interfaces and/or APIs for Java, Python, C/C++


Cheshire II Design OverviewCheshire II Design OverviewCheshire II Design OverviewCheshire II Design Overview

XML DOCS

XMLDIRECTORY

INDEXCLUSTER

INDEXCHESHIRE

CONT

BUILD ASSOC

ZSERVER

CONFIG

COMPONENTDEFINITION

INDEX(S)

ASSOC

CLUSTEREXTENSION


Cheshire III Server OverviewCheshire III Server OverviewCheshire III Server OverviewCheshire III Server Overview

API

INDEXING

T R RX E AS C NL O ST R F D O R M S

SEARCH

P HR AO NT DO LC EO RL

DB API

REMOTESYSTEMS

(any protocol)

XMLCONFIG

& MetadataINFO

INDEXES

LOCAL DB

STAFF UI

CONFIG

NETWORK

RESULTSETS

SCAN

USERINFOC

ONFIG&CONTROL

ACCESSINFO

AUTHENTICATION

CLUSTERING

Native calls

Z39.50SOAPOAI

JDBC

Fetch IDPut ID

OpenURL

APACHE

INTERFACE

SERVERCONTROL

UDDIWSRP

SRW

Normalization

ClientUser/

Clients

OGIS

Cheshire III SERVER


API

INDEXING

T R RX E AS C NL O ST R F D O R M S

SEARCH

P HR AO NT DO LC EO RL

DB API

REMOTESYSTEMS

(any protocol)

XMLCONFIG

& MetadataINFO

INDEXES

LOCAL DB

STAFF UI

CONFIG

NETWORK

RESULTSETS

SCAN

USERINFO

CONFIG&CONTROL

ACCESSINFO

AUTHENTICATION

CLUSTERING

Native calls

Z39.50

SOAP

OAI

JDBC

Fetch ID

Put ID

OpenURL

APACHE

INTERFACE

SERVERCONTROL

UDDI

WSRP

SRW

Normalization

ClientUser/

Clients

OGIS

Cheshire III SERVER


Retain FeaturesRetain FeaturesRetain FeaturesRetain Features

• The intent is to permit all of the types of in The intent is to permit all of the types of in indexing, searching and record formatting indexing, searching and record formatting available now, while making it easier to add available now, while making it easier to add new capabilitiesnew capabilities

• The new system will also support full The new system will also support full UNICODE for content and for metadataUNICODE for content and for metadata

• Store metadata and content in the database Store metadata and content in the database (including config information, etc.)(including config information, etc.)


Permit easy integration of Web Permit easy integration of Web ServicesServices

Permit easy integration of Web Permit easy integration of Web ServicesServices

• The assumption is that the web server will The assumption is that the web server will be the central server mechanism in the be the central server mechanism in the future.future.

• The new design relies on the session The new design relies on the session handling, threading and load management handling, threading and load management tools available in Apache (2.0.40+)tools available in Apache (2.0.40+)

• The Cheshire server is dynamically loaded The Cheshire server is dynamically loaded as part of the Web Serveras part of the Web Server


Multiprotocol SupportMultiprotocol SupportMultiprotocol SupportMultiprotocol Support

• The Web server handles the network issues The Web server handles the network issues and passes requests in various protocols and passes requests in various protocols along to the Cheshire Server. along to the Cheshire Server.

• Individual Protocol “plugins” and the Individual Protocol “plugins” and the Protocol Handler convert search, display, Protocol Handler convert search, display, and metadata requests in a particular and metadata requests in a particular protocol to the internal Cheshire III control protocol to the internal Cheshire III control language, and convert outgoing message language, and convert outgoing message and data to the appropriate protocol formand data to the appropriate protocol form


Distributed & GRID ProcessingDistributed & GRID ProcessingDistributed & GRID ProcessingDistributed & GRID Processing• The server will support protocols for interchange The server will support protocols for interchange

of partial results and collection statistics with a of partial results and collection statistics with a single “Master” controlling the actions of a large single “Master” controlling the actions of a large number of “Slave” serversnumber of “Slave” servers

• These will run in parallel in a GRID environmentThese will run in parallel in a GRID environment• This is still “research” but will probably be using This is still “research” but will probably be using

“Storage Grid” technology from SDSC with our “Storage Grid” technology from SDSC with our own applicationsown applications

• Non-Grid use of the same protocols, etc will be Non-Grid use of the same protocols, etc will be possible (but definitely slower)possible (but definitely slower)


Enhanced ExpanabilityEnhanced ExpanabilityEnhanced ExpanabilityEnhanced Expanability

• Clearly defined APIs for interacting with Clearly defined APIs for interacting with the server will permit easy addition of new the server will permit easy addition of new functionality, or to replace or upgrade functionality, or to replace or upgrade existing functionalityexisting functionality

• Interactive user interface for database Interactive user interface for database configuration and setupconfiguration and setup– We want to make it easier for a We want to make it easier for a

user/administrator to create and manage the user/administrator to create and manage the databasedatabase


Multilingual APIsMultilingual APIsMultilingual APIsMultilingual APIs

• The system is being developed in a The system is being developed in a multilingual environment.multilingual environment.

• We will include the ability to interface with We will include the ability to interface with (at a minimum) Java, Python and C/C++ (at a minimum) Java, Python and C/C++ applications.applications.

• APIs for developing new functions will be APIs for developing new functions will be available in these languages as well available in these languages as well


DevelopmentDevelopmentDevelopmentDevelopment

• Currently work is going on here (RRL) and Currently work is going on here (RRL) and (primarily) in the UK(primarily) in the UK

• We have incomplete (Alpha) versions of the We have incomplete (Alpha) versions of the system, but haven’t been distributing it in system, but haven’t been distributing it in the current form (changing constantly)the current form (changing constantly)

• First release version is expected in mid-’04First release version is expected in mid-’04


Further InformationFurther InformationFurther InformationFurther Information

• Full Cheshire II client and server is open source Full Cheshire II client and server is open source and available for academic and government use: and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ftp://cheshire.berkeley.edu/pub/cheshire/– Includes HTML documentationIncludes HTML documentation

• Project Web Site Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/• Archives Hub http://www.archiveshub.ac.uk/Archives Hub http://www.archiveshub.ac.uk/

http://cheshire.berkeley.edu/

march 2, 2004 ray r. larson cheshire ii: features and internals and cheshire iii overview ray r....

Documents

larson probabilistic

slide probabilistic

larson cheshire

boolean searching

probabilistic searching

probabilistic search

probabilistic elements

overview ray