march 2, 2004 ray r. larson cheshire ii: features and internals and cheshire iii overview ray r....
TRANSCRIPT
March 2, 2004 Ray R. Larson
Cheshire II: Features and InternalsCheshire II: Features and Internalsand Cheshire III overviewand Cheshire III overview
Cheshire II: Features and InternalsCheshire II: Features and Internalsand Cheshire III overviewand Cheshire III overview
Ray R. LarsonRay R. LarsonSchool of Information Management and School of Information Management and
Systems Systems
University of California, BerkeleyUniversity of California, Berkeley
March 2, 2004 Ray R. Larson
OverviewOverviewOverviewOverview
• Cheshire II feature overview Cheshire II feature overview – Logistic Regression Ranking, Okapi BM-25 and Logistic Regression Ranking, Okapi BM-25 and
Boolean OperationsBoolean Operations
– Fusion OperatorsFusion Operators
• Additions from INEX ‘03Additions from INEX ‘03– Element/Index level re-estimation of LR coefficientsElement/Index level re-estimation of LR coefficients
• Adhoc and Heterogeneous Track MethodologyAdhoc and Heterogeneous Track Methodology• Evaluation Results -AdhocEvaluation Results -Adhoc
March 2, 2004 Ray R. Larson
Overview of Cheshire IIOverview of Cheshire IIOverview of Cheshire IIOverview of Cheshire II• It supports SGML and XML with components and component indexesIt supports SGML and XML with components and component indexes• It is a client/server applicationIt is a client/server application• Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI,
SOAP, SDLIP also implementedSOAP, SDLIP also implemented• Server supports a Relational Database GatewayServer supports a Relational Database Gateway• Supports Boolean searching of all serversSupports Boolean searching of all servers• Supports probabilistic ranked retrieval in the Cheshire search engine as Supports probabilistic ranked retrieval in the Cheshire search engine as
well as Boolean and proximity searchwell as Boolean and proximity search• Search engine supports ``nearest neighbor'' searches and relevance Search engine supports ``nearest neighbor'' searches and relevance
feedbackfeedback• GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT• WWW/CGI forms interface for DL, using combined client/server CGI WWW/CGI forms interface for DL, using combined client/server CGI
scripting via WebCheshirescripting via WebCheshire• Scriptable clients using Tcl and PythonScriptable clients using Tcl and Python• Store SGML/XML as files or “Datastore” databaseStore SGML/XML as files or “Datastore” database
March 2, 2004 Ray R. Larson
Cheshire II SearchingCheshire II SearchingCheshire II SearchingCheshire II Searching
Z39.50 Internet
ImagesScannedText
Local Remote
Z39.50
Z39.50
Z39.50
March 2, 2004 Ray R. Larson
INEX OverviewINEX OverviewINEX OverviewINEX Overview
LocalNet
UIOr
Scripts
MapQuery
MapResults
MapQuery
MapResults
INEXSearchEngine
March 2, 2004 Ray R. Larson
Boolean Search CapabilityBoolean Search CapabilityBoolean Search CapabilityBoolean Search Capability• All Boolean operations are supportedAll Boolean operations are supported
– ““zfind author x and (title y or subject z) not subject zfind author x and (title y or subject z) not subject A”A”
• Named sets are supported and stored on the Named sets are supported and stored on the serverserver
• Boolean operations between stored sets are Boolean operations between stored sets are supportedsupported– ““zfind SET1 and subject widgets or SET2”zfind SET1 and subject widgets or SET2”
• Nested parentheses and truncation are supportedNested parentheses and truncation are supported– ““zfind xtitle Alice#”zfind xtitle Alice#”
March 2, 2004 Ray R. Larson
Probabilistic RetrievalProbabilistic RetrievalProbabilistic RetrievalProbabilistic Retrieval• Uses Logistic Regression ranking method developed Uses Logistic Regression ranking method developed
at Berkeley (W. Cooper, F. Gey, D. Dabney, A. at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at Chen) with new algorithm for weigh calculation at retrieval timeretrieval time
• Z39.50 “relevance” operator used to indicate Z39.50 “relevance” operator used to indicate probabilistic searchprobabilistic search
• Any index can have Probabilistic searching Any index can have Probabilistic searching performed:performed:– zfind topic @ “cheshire cats, looking glasses, march hares zfind topic @ “cheshire cats, looking glasses, march hares
and other such things”and other such things”– zfind title @ caucus raceszfind title @ caucus races
• Boolean and Probabilistic elements can be Boolean and Probabilistic elements can be combined:combined:– zfind topic @ government documents and title zfind topic @ government documents and title
guidebooksguidebooks
March 2, 2004 Ray R. Larson
P(R | Q,D) b0 biX i
i1
6
Probability of relevance is based onProbability of relevance is based onLogistic regression from a sample set of documentsLogistic regression from a sample set of documentsto determine values of the coefficients. to determine values of the coefficients. At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:
For the 6 For the 6 XX attribute measures shown on the next slide attribute measures shown on the next slide
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
March 2, 2004 Ray R. Larson
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query FrequencyAverage Absolute Query Frequency
Query LengthQuery Length
Average Absolute Component FrequencyAverage Absolute Component Frequency
Document LengthDocument Length
Average Inverse Component FrequencyAverage Inverse Component Frequency
Inverse Component FrequencyInverse Component Frequency
Number of Terms in common between Number of Terms in common between query and Component -- logged query and Component -- logged
March 2, 2004 Ray R. Larson
Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements
Combining Boolean and Combining Boolean and Probabilistic Search ElementsProbabilistic Search Elements
• Two original approaches:Two original approaches:– Boolean ApproachBoolean Approach
– Non-probabilistic “Fusion Search” Set merger approach Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate is a weighted merger of document scores from separate Boolean and Probabilistic queries Boolean and Probabilistic queries
P(R | Q,D) P(R | Qbool ,D)P(R | Qprob ,D)
P(R | Qbool ,D) 1: if Boolean eval successful for D
0 : Otherwise
March 2, 2004 Ray R. Larson
Okapi BM25Okapi BM25Okapi BM25Okapi BM25
• Where:Where:• QQ is a query containing terms is a query containing terms TT• K K is is kk11((1-((1-bb) + ) + b.dlb.dl//avdlavdl))• kk11, b , b and and kk33 are parameters , usually 1.2, 0.75 and 7-1000are parameters , usually 1.2, 0.75 and 7-1000• tftf is the frequency of the term in a specific document is the frequency of the term in a specific document• qtf qtf is the frequency of the term in a topic from which is the frequency of the term in a topic from which QQ was derived was derived• dl dl and and avdl avdl are the document length and the average document length are the document length and the average document length
measured in some convenient unitmeasured in some convenient unit• ww(1) (1) is the Robertson-Sparck Jones weight.is the Robertson-Sparck Jones weight.
QT qtfk
qtfk
tfK
tfkw
3
31)1( )1()1(
5.05.05.0
5.0
log)1(
rRnNrnrR
r
w
March 2, 2004 Ray R. Larson
Merging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking OperatorsMerging and Ranking Operators• Extends the capabilities of merging to include Extends the capabilities of merging to include
merger operations in queries like Boolean operatorsmerger operations in queries like Boolean operators• Fuzzy Logic Operators (not used for INEX)Fuzzy Logic Operators (not used for INEX)
– !FUZZY_AND!FUZZY_AND– !FUZZY_OR!FUZZY_OR– !FUZZY_NOT!FUZZY_NOT
• Containment operators: Restrict components to or Containment operators: Restrict components to or with a particular parent with a particular parent – !RESTRICT_FROM!RESTRICT_FROM– !RESTRICT_TO!RESTRICT_TO
• Merge OperatorsMerge Operators– !MERGE_SUM!MERGE_SUM– !MERGE_MEAN!MERGE_MEAN– !MERGE_NORM!MERGE_NORM– !MERGE_CMBZ!MERGE_CMBZ
March 2, 2004 Ray R. Larson
Subquery
INEX ‘04 Fusion SearchINEX ‘04 Fusion SearchINEX ‘04 Fusion SearchINEX ‘04 Fusion Search
• Merge multiple ranked and Boolean index Merge multiple ranked and Boolean index searches within each query and multiple searches within each query and multiple component search resultsetscomponent search resultsets– Major components merged are Articles, Body, Major components merged are Articles, Body,
Sections, subsections, paragraphsSections, subsections, paragraphs
Subquery
Subquery
Subquery
Comp.QueryResultsComp.
QueryResults
Fusion/Merge
FinalRanked
List
March 2, 2004 Ray R. Larson
New LR CoefficientsNew LR CoefficientsNew LR CoefficientsNew LR CoefficientsIndexIndex b0b0 b1b1 b2b2 b3b3 b4b4 b5b5 b6b6
BaseBase -3.700-3.700 1.2691.269 -0.310-0.310 0.6790.679 -0.021-0.021 0.2230.223 4.0104.010
topictopic -7.758-7.758 5.6705.670 -3.427-3.427 1.7871.787 -0.030-0.030 1.9521.952 5.8805.880
topicshorttopicshort -6.364-6.364 2.7392.739 -1.443-1.443 1.2281.228 -0.020-0.020 1.2801.280 3.8373.837
abstractabstract -5.892-5.892 2.3182.318 -1.364-1.364 0.8600.860 -0.013-0.013 1.0521.052 3.6003.600
alltitlesalltitles -5.243-5.243 2.3192.319 -1.361-1.361 1.4151.415 -0.037-0.037 1.1801.180 3.6963.696
sec wordssec words -6.392-6.392 2.1252.125 -1.648-1.648 1.1061.106 -0.075-0.075 1.1741.174 3.6323.632
para para wordswords
-8.632-8.632 1.2581.258 -1.654-1.654 1.4851.485 -0.084-0.084 1.1431.143 4.0044.004
Estimates using INEX ‘03 relevance assessments forEstimates using INEX ‘03 relevance assessments forb1 = Average Absolute Query Frequencyb1 = Average Absolute Query Frequencyb2 = Query Lengthb2 = Query Lengthb3 = Average Absolute Component Frequencyb3 = Average Absolute Component Frequencyb4 = Document Lengthb4 = Document Lengthb5 = Average Inverse Component Frequencyb5 = Average Inverse Component Frequencyb6 = Number of Terms in common between queryb6 = Number of Terms in common between query and Component and Component
March 2, 2004 Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support
• Underlying native format for all data is SGML or Underlying native format for all data is SGML or XMLXML
• The DTD defines the file format for each fileThe DTD defines the file format for each file• Full SGML/XML parsingFull SGML/XML parsing• SGML/XML Format Configuration Files define SGML/XML Format Configuration Files define
the databasethe database• USMARC DTD and MARC to SGML conversion USMARC DTD and MARC to SGML conversion
(and back again)(and back again)• Access to full-text via special SGML/XML tagsAccess to full-text via special SGML/XML tags
March 2, 2004 Ray R. Larson
IndexingIndexingIndexingIndexing
• Any SGML/XML tagged field or attribute can be Any SGML/XML tagged field or attribute can be indexed:indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)B-Tree and Hash access via Berkeley DB (Sleepycat)
– Stemming, keyword, exact keys and “special keys”Stemming, keyword, exact keys and “special keys”
– Mapping from any Z39.50 Attribute combination to a Mapping from any Z39.50 Attribute combination to a specific indexspecific index
– Underlying postings information includes term Underlying postings information includes term frequency for probabilistic searchingfrequency for probabilistic searching
• Component extraction with separate component Component extraction with separate component indexesindexes
March 2, 2004 Ray R. Larson
XML Element ExtractionXML Element ExtractionXML Element ExtractionXML Element Extraction
• A new search “ElementSetName” is A new search “ElementSetName” is XML_ELEMENT_XML_ELEMENT_
• Any Xpath, element name, or regular Any Xpath, element name, or regular expression can be included following the expression can be included following the final underscore when submitting a present final underscore when submitting a present requestrequest
• The matching elements are extracted from The matching elements are extracted from the records matching the search and the records matching the search and delivered in a simple format..delivered in a simple format..
March 2, 2004 Ray R. Larson
XML ExtractionXML ExtractionXML ExtractionXML Extraction
% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…
March 2, 2004 Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support
• Configuration files for the Server are Configuration files for the Server are SGML/XML:SGML/XML:– They include elements describing all of the data They include elements describing all of the data
files and indexes for the database.files and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to
be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.
March 2, 2004 Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support• Example XML record for a DL documentExample XML record for a DL document
<ELIB-BIB><BIB-VERSION>ELIB-v1.0</BIB-VERSION><ID>756</ID><ENTRY>June 12, 1996</ENTRY><DATE>June 1996</DATE><TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE><ORGANIZATION>University of California</ORGANIZATION><TYPE>report</TYPE><AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL><AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL><PROJECT>SNEP</PROJECT><SERIES>Vol 3</SERIES><PAGES>40</PAGES><TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF><PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF></ELIB-BIB>
March 2, 2004 Ray R. Larson
<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...
SGML SupportSGML SupportSGML SupportSGML Support
• Example SGML/MARC RecordExample SGML/MARC Record
March 2, 2004 Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML SupportTREC document…TREC document…
<DOC><DOCNO>FT931-3566</DOCNO><PROFILE>_AN-DCPCCAA3FT</PROFILE><DATE>930316</DATE><HEADLINE>FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key tounlocking Tangentopoli - They will set the investigation agenda</HEADLINE><BYLINE> By ROBERT GRAHAM</BYLINE><TEXT>OVER the weekend the Italian media felt obliged to comment on a non-event.No new arrests had taken place in any of the country's ever more numerouscorruption scandals which centre on the illicit funding of political parties...</TEXT><XX> …
March 2, 2004 Ray R. Larson
…Companies:-</XX><CO>Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale.</CO><XX>Countries:-</XX><CN>ITZ Italy, EC.</CN><XX>Industries:-</XX><IN>P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC.</IN><XX>Types:-</XX> …
March 2, 2004 Ray R. Larson
…
<TP>CMMT Comment & Analysis.
GOVT Legal issues.
</TP>
<PUB>The Financial Times
</PUB>
<PAGE>
London Page 4
</PAGE>
</DOC>
March 2, 2004 Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support
• INEX DocumentINEX Document<article><fno>C1050</fno><doi>10.1041/C1050s-2000</doi><fm><hdr><hdr1><ti>COMPUTING IN SCIENCE & ENGINEERING</ti><crt><issn>1521-9615</issn>/00/$10.00 <cci><onm>© 2000 IEEE</onm></cci></crt></hdr1><hdr2><obi><volno>Vol. 2</volno><issno>No. 1</issno></obi><pdt><mo>JANUARY/FEBRUARY</mo><yr>2000</yr></pdt><pp>pp. 50-59</pp></hdr2></hdr><tig><atl>The Decompositional Approach to Matrix Computation</atl><pn>pp. 50-59</pn></tig><au sequence="first"><fnm>G.W.</fnm><snm>Stewart</snm><aff><onm>University of Maryland</onm></aff></au><fig><art file="c1050x1.gif" w="425" h="321" tw="150" th="113"/></fig><abs><p>The introduction of matrix decomposition into numerical linear algebra revolutionized matrix computations. This article outlines the decompositional approach, comments on its history, and surveys the six most widely used decompositions.</p></abs></fm><bdy><sec><st></st><ip1>In 1951, Paul S. Dwyer published <it>Linear Computations</it>, perhaps the first book devoted entirely to numerical linear algebra.<ref rid="bibc10501" type="bib">1</ref> Digital computing was in its infancy, and Dwyer focused on computation with mechanical calculators. Nonetheless, the book was state of the art. <ref rid="c10501" type="fig">Figure 1</ref> reproduces a page of the book dealing with Gaussian elimination. In 1954, Alston S. Householder published <it>Principles of Numerical Analysis</it>,<ref rid="bibc10502" type="bib">2</ref> one of the first modern treatments of high-speed digital computation. <ref rid="c10502" type="fig">Figure 2</ref> reproduces a page from this book, also dealing with Gaussian elimination.</ip1><fig id="c10501"><art file="c10501.gif" w="600" h="970" tw="150" th="243"/><no>1</no><fgc>This page from <it>Linear Computations</it> shows that Paul Dwyer's approach begins with a system of scalar equations. Courtesy of John Wiley & Sons.</fgc></fig><fig id="c10502"><art file="c10502.gif" w="500" h="807" tw="150" th="242"/><no>2</no><fgc>On this page from <it>Principles of Numerical Analysis</it>, Alston Householder uses partitioned matrices and LU decomposition. Courtesy of McGraw-Hill.</fgc></fig><p>The contrast between these two excerpts is striking. The most obvious difference is that Dwyer used scalar equations whereas Householder used partitioned matrices. …
March 2, 2004 Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support…<sec><st>CONCLUSION</st><ip1>The big six are not the only decompositions in use; in fact, there are many more. As mentioned earlier, certain intermediate forms—such as tridiagonal and Hessenberg forms—have come to be regarded as decompositions in their own right. Since the singular value decomposition is expensive to compute and not readily updated, rank-revealing alternatives have received considerable attention.<ref rid="bibc105054" type="bib">54</ref><super>,</super><ref rid="bibc105055" type="bib">55</ref> There are also generalizations of the singular value decomposition and the Schur decomposition for pairs of matrices.<ref rid="bibc105056" type="bib">56</ref><super>,</super><ref rid="bibc105057" type="bib">57</ref> All crystal balls become cloudy when they look to the future, but it seems safe to say that as long as new matrix problems arise, new decompositions will be devised to solve them.</ip1></sec></bdy><bm><ack><h>Acknowledgment</h><ip1><it>This work was supported by the National Science Foundation under Grant No. 970909-8562.</it></ip1></ack><bib><bibl><h>References</h><bb id="bibc10501"><au><fnm>P.S.</fnm><snm>Dwyer</snm></au><ti>Linear Computations,</ti><obi>John Wiley & Sons,</obi><loc><cty>New York,</cty></loc><pdt><yr>1951.</yr></pdt></bb><bb id="bibc10502"><au><fnm>A.S.</fnm><snm>Householder</snm></au><ti>Principles of Numerical Analysis,</ti><obi>McGraw-Hill,</obi><loc><cty>New York,</cty></loc><pdt><yr>1953.</yr></pdt></bb><bb id="bibc10503"><au><fnm>J.H.</fnm><snm>Wilkinson</snm></au><obi>and</obi><au><fnm>C.</fnm><snm>Reinsch</snm></au><ti>Handbook for Automatic Computation, Vol. II, Linear Algebra,</ti><obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt><yr>1971.</yr></pdt></bb><bb id="bibc10504"><au><fnm>B.S.</fnm><snm>Garbow</snm></au><obi>et al.,</obi><atl>"Matrix Eigensystem Routines—Eispack Guide Extension,"</atl><ti>Lecture Notes in Computer Science,</ti><obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt><yr>1977.</yr></pdt></bb><bb id="bibc10505"><au><fnm>J.J.</fnm><snm>Dongarra</snm></au><obi>et al.,</obi><ti>LINPACK User's Guide,</ti><obi>SIAM,</obi><loc><cty>Philadelphia,</cty></loc><pdt><yr>1979.</yr></pdt></bb> …
March 2, 2004 Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support
• INEX CAS QueryINEX CAS Query<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE inex_topic SYSTEM "topic.dtd"><inex_topic topic_id="70" query_type="CAS" ct_no="49"><title> /article[about(./fm/abs,'"information retrieval" "digital libraries"')]</title><description>Retrieve articles with an abstract indicating the articleis about information retrieval and/or digital libraries</description><narrative>To be relevant the retrieved articles must be aboutinformation retrieval, digital libraries or, preferably both. Articlesabout information retrieval from digital libraries will receive thehighest relevance judgements.</narrative><keywords>information retrieval,digital libraries</keywords></inex_topic>
March 2, 2004 Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support
• Configuration files for the Server are also Configuration files for the Server are also SGML/XML:SGML/XML:– They include tags describing all of the data files They include tags describing all of the data files
and indexes for the database.and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to
be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.
March 2, 2004 Ray R. Larson
Cheshire Configuration FilesCheshire Configuration FilesCheshire Configuration FilesCheshire Configuration Files<!-- ******************************************************************* --><!-- ************************* TREC INTERACTIVE TEST DB **************** --><!-- ******************************************************************* --><!-- This is the config file for the Cheshire II TREC interactive Database --><DBCONFIG><DBENV>/projects/is240/GroupX/indexes </DBENV>
<!-- --><!-- TREC TEST DATABASE FILEDEF --><!-- -->
<!-- The Interactive TREC Financial Times datafile --><FILEDEF TYPE=SGML>
<DEFAULTPATH>/projects/is240/GroupX </DEFAULTPATH>
<!-- filetag is the "shorthand" name of the file --><FILETAG> trec </FILETAG>
<!-- filename is the full path name of the main data directory --><FILENAME> /projects/is240/ft </FILENAME>
<CONTINCLUDE> /projects/is240/ft.CONT </CONTINCLUDE>
<!-- fileDTD is the full path name of the file's DTD --><FILEDTD> /projects/is240/TREC.FT.DTD </FILEDTD><!-- assocfil is the full path name of the file's Associator --><ASSOCFIL> ft.assoc </ASSOCFIL>
<!-- history is the full path name of the file's history file --><HISTORY> cheshire_index/TESTDATA.history </HISTORY>…
March 2, 2004 Ray R. Larson
IndexingIndexingIndexingIndexing
• Any SGML/XML tagged field or attribute can be Any SGML/XML tagged field or attribute can be indexed:indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)B-Tree and Hash access via Berkeley DB (Sleepycat)– Stemming, keyword, exact keys and “special keys”Stemming, keyword, exact keys and “special keys”– Mapping from any Z39.50 Attribute combination to a Mapping from any Z39.50 Attribute combination to a
specific indexspecific index– Underlying postings information includes term frequency Underlying postings information includes term frequency
for probabilistic searching.for probabilistic searching.– SGML may include address of full-text for indexingSGML may include address of full-text for indexing
• New indexes can be easily added, or old ones deletedNew indexes can be easily added, or old ones deleted
March 2, 2004 Ray R. Larson
Bitmapped IndexesBitmapped IndexesBitmapped IndexesBitmapped Indexes
• Bitmap indexes can be used for Boolean Bitmap indexes can be used for Boolean operations where the data has only a few operations where the data has only a few values and very large numbers of items with values and very large numbers of items with each valueeach value
• Only one bit per record stored in the indexOnly one bit per record stored in the index
• Processed on a demand basis so only blocks Processed on a demand basis so only blocks with the bits needed to resolve a query are with the bits needed to resolve a query are fetchedfetched
March 2, 2004 Ray R. Larson
<!-- The following are the index definitions for the file --><INDEXES>
<!-- ******************************************************************* --><!-- ************************* DOC NO. ********************************* --><!-- ******************************************************************* --><!-- The following provides document number access. --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE PRIMARYKEY=IGNORE><INDXNAME> cheshire_index/trec.docno.index </INDXNAME><INDXTAG> docno </INDXTAG>
<INDXMAP><USE> 12 </USE><struct> 1 </struct> </INDXMAP>
<INDXMAP><USE> 12 </USE><struct> 2 </struct> </INDXMAP>
<INDXMAP><USE> 12 </USE><struct> 6 </struct> </INDXMAP>
<INDXKEY><TAGSPEC><FTAG>DOCNO </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>…
March 2, 2004 Ray R. Larson
<!-- ******************************************************************* --><!-- ************************* TOPIC *********************************** --><!-- ******************************************************************* --><!-- The following is the primary index for probabilistic searches --><!-- It includes headlines, datelines, bylines, and full text --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> cheshire_index/trec.topic.index </INDXNAME><INDXTAG> topic </INDXTAG>
<INDXMAP><USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXMAP><USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>…<STOPLIST> cheshire_index/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>HEADLINE </FTAG><FTAG>DATELINE </FTAG><FTAG>BYLINE </FTAG><FTAG>TEXT </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
Cheshire II – EVI GenerationCheshire II – EVI GenerationCheshire II – EVI GenerationCheshire II – EVI Generation
• Entry Vocabulary Indexes can improve access to data with Entry Vocabulary Indexes can improve access to data with controlled index termscontrolled index terms
• Define Define basisbasis for clustering records. for clustering records.– Select field to form the Select field to form the basis basis of the cluster.of the cluster.– EvidenceEvidence Fields to use as contents of the pseudo-documents. Fields to use as contents of the pseudo-documents.
• During indexing cluster keys are generated with During indexing cluster keys are generated with basisbasis and and evidenceevidence from each record. from each record.
• Cluster keys are sorted and merged on basis and pseudo-Cluster keys are sorted and merged on basis and pseudo-documents created for each unique documents created for each unique basisbasis element element containing all evidence fields.containing all evidence fields.
• Pseudo-Documents (Class clusters) are indexed on Pseudo-Documents (Class clusters) are indexed on combined evidence fields.combined evidence fields.
March 2, 2004 Ray R. Larson
EVI/Cluster DefinitionsEVI/Cluster DefinitionsEVI/Cluster DefinitionsEVI/Cluster Definitions<!-- ************************* CLUSTER ********************************* --><!-- *********************** DEFINITIONS ******************************* -->
<CLUSTER><clusname> classcluster </clusname><cluskey normal=CLASSCLUS>
<tagspec><FTAG>FLD950 </FTAG> <s> ^a </s>
</tagspec></cluskey><stoplist> /usr3/cheshire2/data2/clasclusstoplist </stoplist><clusmap>
<from> <tagspec><ftag>FLD245</ftag><s>^[ab]</s><ftag>FLD440</ftag><s>^a</s><ftag>FLD490</ftag><s>^a</s><ftag>FLD830</ftag><s>^a</s><ftag>FLD740</ftag><s>^a</s>
</tagspec></from><to> <tagspec>
<ftag>titles</ftag> </tagspec></to><from> <tagspec>
<ftag>FLD6..</ftag><s>^[abcdxyz]</s> </tagspec></from><to> <tagspec>
<ftag>subjects</ftag> </tagspec></to><summarize> <maxnum> 5 </maxnum>
<tagspec> <ftag>subjsum</ftag></tagspec></summarize>
</clusmap></CLUSTER>
March 2, 2004 Ray R. Larson
Component Extraction and Component Extraction and IndexingIndexing
Component Extraction and Component Extraction and IndexingIndexing
• Any element (or range of SGML/XML data Any element (or range of SGML/XML data starting with one element and ending with starting with one element and ending with another) can be defined as a ‘component’ another) can be defined as a ‘component’ and accessed and indexed as if it were an and accessed and indexed as if it were an entire document.entire document.
• Component indexes and document-level Component indexes and document-level indexes can be combined in search indexes can be combined in search operations (and special operators permit operations (and special operators permit selection of document or components as the selection of document or components as the resultresult
March 2, 2004 Ray R. Larson
Component DefinitionsComponent DefinitionsComponent DefinitionsComponent Definitions<COMPONENTS><COMPONENTDEF><COMPONENTNAME> TESTDATA/COMPONENT_DB1 </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG> <TAGSPEC> <FTAG>mainenty </FTAG> <FTAG>titles </FTAG> </TAGSPEC></COMPSTARTTAG><COMPENDTAG> <TAGSPEC><FTAG>Fld300 </FTAG></TAGSPEC></COMPENDTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> TESTDATA/comp1index1.author…</INDEXDEF></COMPONENTDEF></COMPONENTS>
March 2, 2004 Ray R. Larson
Result Formatting (Display)Result Formatting (Display)Result Formatting (Display)Result Formatting (Display)<DISPOPTIONS>KEEP_ENTITIES</DISPOPTIONS>
<DISPLAY> <FORMAT NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="TAGSET-G"> <clusmap> <from> <tagspec> <ftag>DOCNO</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> </clusmap> </convert></FORMAT></DISPLAY>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ********************* Config for INEX evaluation ****************** --><!-- ******************************************************************* --><!-- This is the config file for the Cheshire II TREC interactive Database --><!-- new version uses proximity indexes... -->
<DBCONFIG><DBENV>/projects/metadata/cheshire/TREC/cheshire_index </DBENV>
<!-- --><!-- INEX TEST DATABASE FILEDEF --><!-- -->
<FILEDEF TYPE=XML><DEFAULTPATH> /projects/metadata/cheshire/INEX </DEFAULTPATH><!-- filetag is the "shorthand" name of the file --><FILETAG> INEX </FILETAG>
<!-- filename is the full path name of the main data directory --><FILENAME> inex-1.3/xml </FILENAME>
<CONTINCLUDE> inex-1.3/xml_main.cont </CONTINCLUDE>
<!-- fileDTD is the full path name of the file's DTD --><FILEDTD> inex-1.3/dtd/wrapper.dtd </FILEDTD><SGMLCAT> inex-1.3/dtd/catalog </SGMLCAT>
<!-- assocfil is the full path name of the file's Associator --><ASSOCFIL> inex-1.3/xml_main.assoc </ASSOCFIL>
<!-- history is the full path name of the file's history file --><HISTORY> inex.history </HISTORY>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- The following are the index definitions for the file --><INDEXES>
<!-- ******************************************************************* --><!-- ************************* DOC NO. ********************************* --><!-- ******************************************************************* --><!-- The following provides document number access. --><INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=DO_NOT_NORMALIZE PRIMARYKEY=IGNORE><INDXNAME> indexes/docno.index </INDXNAME><INDXTAG> docno </INDXTAG>
<INDXMAP><USE> 12 </USE><struct> 1 </struct> </INDXMAP>
<INDXMAP><USE> 12 </USE><struct> 2 </struct> </INDXMAP>
<INDXMAP><USE> 12 </USE><struct> 6 </struct> </INDXMAP>
<INDXKEY><TAGSPEC><FTAG> doi </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ********************** PERSONAL AUTHOR/BYLINE ********************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/pauthor.index</INDXNAME><INDXTAG> pauthor </INDXTAG>
<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 1 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXMAP> <USE> 1004 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<!-- The stoplist for this file --><STOPLIST> indexes/authorstoplist </STOPLIST>
<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>au</S><S>snm</S><FTAG>fm</FTAG><S>au</S><S>fnm</S></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* TITLE/HEADLINE ************************** --><!-- ******************************************************************* --><!-- The following provides keyword title access --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM><INDXNAME> indexes/title.index </INDXNAME><INDXTAG> title </INDXTAG>
<INDXMAP><USE> 4 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> <USE> 5 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> <USE> 6 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<STOPLIST> indexes/titlestoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>tig</S><S>atl</S></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* TOPIC *********************************** --><!-- ******************************************************************* --><!-- The following is the primary index for probabilistic searches --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM><INDXNAME> indexes/topic.index </INDXNAME><INDXTAG> topic </INDXTAG>
<INDXMAP><USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>…<INDXMAP><USE> 1017 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>tig</S><S>atl</S><FTAG>abs</FTAG><FTAG>bdy</FTAG><FTAG>bibl</FTAG><S>bb</S><S>atl</S><FTAG>app</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************** DATE *********************************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR><INDXNAME> indexes/date.index</INDXNAME><INDXTAG> date</INDXTAG>
<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP><USE> 30 </USE><POSIT> 3 </posit> <struct> 6 </struct></INDXMAP><INDXMAP><USE> 30 </USE><POSIT> 3 </posit> <struct> 5 </struct></INDXMAP>
<INDXKEY><TAGSPEC>
<FTAG>hdr2</FTAG><s>yr</s></TAGSPEC></INDXKEY></INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************** JOURNAL ******************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/journal.index</INDXNAME><INDXTAG> journal</INDXTAG>
<INDXMAP><USE> 1022 </USE><POSIT> 3 </posit> <struct> 6 </struct></INDXMAP><INDXMAP><USE> 1022 </USE><POSIT> 3 </posit> <struct> 5 </struct></INDXMAP>
<INDXKEY><TAGSPEC>
<FTAG>hdr1</FTAG><s>ti</s></TAGSPEC></INDXKEY></INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* KEYWORDS********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/keywords.index </INDXNAME><INDXTAG> kwd </INDXTAG>
<INDXMAP><USE> 3121 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>kwd</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* ABSTRACT********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/abstract.index </INDXNAME><INDXTAG> abstract </INDXTAG>
<INDXMAP><USE> 62 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>abs</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- The following index has contents of the SEQUENCE attribute of the --><!-- au (author) tag: either "first" or "additional" --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/author_seq.index</INDXNAME><INDXTAG> author_seq </INDXTAG><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>au</S><ATTR>sequence</ATTR></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* Bib author Forename ******************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/bib_author_fnm.index</INDXNAME><INDXTAG> bib_author_fnm </INDXTAG>
<INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXKEY><TAGSPEC><FTAG>bb</FTAG><s>au</s><s>fnm</s> </TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* Bib author surname ******************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/bib_author_snm.index</INDXNAME><INDXTAG> bib_author_snm </INDXTAG>
<INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXKEY><TAGSPEC><FTAG>bb</FTAG><s>au</s><s>snm</s> </TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* FIGURES ********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM><INDXNAME> indexes/fig.index </INDXNAME><INDXTAG> fig </INDXTAG>
<INDXMAP><USE> 3150 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fig</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* acknowledgements ************************ --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM><INDXNAME> indexes/ack.index </INDXNAME><INDXTAG> ack </INDXTAG>
<INDXMAP><USE> 3188 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<STOPLIST> indexes/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>ack</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* alltitles ******************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/alltitles.index </INDXNAME><INDXTAG> alltitles </INDXTAG>
<INDXMAP><USE> 3188 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<STOPLIST> indexes/titlestoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>atl</FTAG><FTAG>st</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* Affiliation ***************************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/affil.index </INDXNAME><INDXTAG> affil </INDXTAG>
<INDXMAP><USE> 3189 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<STOPLIST> indexes/titlestoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>fm</FTAG><s>aff</s></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* FNO ********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=none><INDXNAME> indexes/fno.index </INDXNAME><INDXTAG> fno </INDXTAG>
<INDXMAP><USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>fno</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* FIGNO ********************************* --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=INTEGER NORMAL=NONE><INDXNAME> indexes/figno.index </INDXNAME><INDXTAG> figno </INDXTAG>
<INDXMAP><USE> 3193 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>fig</FTAG><s>no</s></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<!-- ******************************************************************* --><!-- ************************* topicshort ******************************** --><!-- ******************************************************************* --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/topicshort.index </INDXNAME><INDXTAG> topicshort </INDXTAG>
<INDXMAP><USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>fm</FTAG><S>tig</S><S>atl</S><FTAG>abs</FTAG><FTAG>kwd</FTAG><FTAG>st</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
</INDEXES>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTS><COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_SECTION </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>sec</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def -->
<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE><INDXNAME> indexes/sec_title2.index</INDXNAME><INDXTAG> sec_title </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/titlestoplist </STOPLIST>
<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>sec</FTAG><s>st</s></TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/sec_words.index</INDXNAME><INDXTAG> sec_words </INDXTAG>
<!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST>
<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>sec</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
</COMPONENTINDEXES></COMPONENTDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_BIB </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>bm</FTAG><S>bib</S><s>bibl</s><s>bb</s></TAGSPEC></COMPSTARTTAG><!-- /* no end tag */ --><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/bib_author.index</INDXNAME><INDXTAG> bib_author </INDXTAG><!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>au</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE><INDXNAME> indexes/bib_title.index</INDXNAME><INDXTAG> bib_title </INDXTAG>
<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 33 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>atl</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR><INDXNAME> indexes/bib_date.index</INDXNAME><INDXTAG> bib_date </INDXTAG>
<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 31 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>pdt</FTAG><s>yr</s> </TAGSPEC> </INDXKEY> </INDEXDEF>
</COMPONENTINDEXES></COMPONENTDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST><!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
</COMPONENTINDEXES></COMPONENTDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_FIG </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>fig</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> indexes/fig_caption.index</INDXNAME><INDXTAG> fig_caption </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<!-- The stoplist for this file --><STOPLIST> indexes/titlestoplist </STOPLIST>
<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>fgc</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF></COMPONENTINDEXES></COMPONENTDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_VITAE </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>vt</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE><INDXNAME> indexes/vitae_words.index</INDXNAME><INDXTAG> vt_vitae </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/titlestoplist </STOPLIST>
<!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>vt</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF></COMPONENTINDEXES></COMPONENTDEF></COMPONENTS>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<DISPOPTIONS>KEEP_ENTITIES</DISPOPTIONS>
<DISPLAY> <DISPLAYDEF NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="MIXED"> <clusmap> <from> <tagspec> <ftag>doi</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> <from> <tagspec> <ftag>#DBNAME#</ftag> </tagspec></from>…
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<DISPLAYDEF name="XML_ELEMENT_" OID="1.2.840.10003.5.109.10"><convert function="XML_ELEMENT"> <clusmap> <from> <tagspec> <ftag>#FILENAME#</ftag> </tagspec></from> <to> <tagspec> <ftag>FILENAME</ftag> </tagspec></to> <from> <tagspec> <ftag>#RANK#</ftag> </tagspec></from> <to> <tagspec> <ftag>RANK </ftag> </tagspec></to> …
…<from> <tagspec> <ftag>#RAWSCORE#</ftag> </tagspec></from> <to> <tagspec> <ftag>RAWSCORE </ftag> </tagspec></to> <from> <tagspec> <ftag> SUBST_ELEMENT </ftag> </tagspec></from> <to> <tagspec> <ftag> SUBST_ELEMENT </ftag> </tagspec> </to> </clusmap></convert></DISPLAYDEF>
</DISPLAY></FILEDEF>
</DBCONFIG>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST><!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
</COMPONENTINDEXES></COMPONENTDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST><!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
</COMPONENTINDEXES></COMPONENTDEF>
March 2, 2004 Ray R. Larson
INEX Configuration ExampleINEX Configuration ExampleINEX Configuration ExampleINEX Configuration Example<COMPONENTDEF><COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><TAGSPEC><FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG></TAGSPEC></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> indexes/para_words.index</INDXNAME><INDXTAG> para_words </INDXTAG><!-- the appropriate Z39.50 BIB1 attribute numbers --><INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><!-- The stoplist for this file --><STOPLIST> indexes/topicstoplist </STOPLIST><!-- The INDXKEY area contains the specifications of tags in the doc --><!-- that are to be extracted and indexed for this index --><INDXKEY><TAGSPEC><FTAG>.*</FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
</COMPONENTINDEXES></COMPONENTDEF>
March 2, 2004 Ray R. Larson
XML Schemas and Element XML Schemas and Element RetrievalRetrieval
March 2, 2004 Ray R. Larson
XML Schema SupportXML Schema SupportXML Schema SupportXML Schema Support
• XML Schemas or DTD’s can be used to XML Schemas or DTD’s can be used to define the data contentsdefine the data contents
• Tested with a wide variety of schemas Tested with a wide variety of schemas including METS (with various supporting including METS (with various supporting schemas)schemas)
March 2, 2004 Ray R. Larson
XML Element ExtractionXML Element ExtractionXML Element ExtractionXML Element Extraction
• A new search “ElementSetName” is A new search “ElementSetName” is XML_ELEMENT_XML_ELEMENT_
• Any Xpath, element name, or regular Any Xpath, element name, or regular expression can be included following the expression can be included following the final underscore when submitting a present final underscore when submitting a present request (Note only a subset of full Xpath is request (Note only a subset of full Xpath is available)available)
• The matching elements are extracted from The matching elements are extracted from the records matching the search and the records matching the search and delivered in a simple format..delivered in a simple format..
March 2, 2004 Ray R. Larson
XML ExtractionXML ExtractionXML ExtractionXML Extraction
% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…
March 2, 2004 Ray R. Larson
Database Storage Database Storage Database Storage Database Storage
• All data stored as SGML/XML flat text files plus All data stored as SGML/XML flat text files plus optional linked full-text (non-XML) files optional linked full-text (non-XML) files
• File format is defined though SGML/XML DTD File format is defined though SGML/XML DTD (also flat text file) or Schema(also flat text file) or Schema
• ““Associator” files provide indexed direct access to Associator” files provide indexed direct access to each record in SGML/XML files.each record in SGML/XML files.– Contain offset and record length for each “record”Contain offset and record length for each “record”
– Associators can be built to index any conformant Associators can be built to index any conformant document in a directory sub-treedocument in a directory sub-tree
March 2, 2004 Ray R. Larson
INEX CO RunsINEX CO RunsINEX CO RunsINEX CO Runs
• Three official, one later run - all Title-onlyThree official, one later run - all Title-only– Fusion - Combines Okapi and LR using the Fusion - Combines Okapi and LR using the
MERGE_CMBZ operatorMERGE_CMBZ operator– NewParms (LR)- Using only LR with the new NewParms (LR)- Using only LR with the new
parametersparameters– Feedback - An attempt at blind relevance Feedback - An attempt at blind relevance
feedbackfeedback
– PostFusion - Fusion of the new LR coefficients PostFusion - Fusion of the new LR coefficients and Okapiand Okapi
March 2, 2004 Ray R. Larson
Query Generation - COQuery Generation - COQuery Generation - COQuery Generation - CO
• # 162 TITLE = Text and Index Compression # 162 TITLE = Text and Index Compression Algorithms Algorithms
• QUERY: QUERY: topicshort @+ {Text and Index topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !{Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms})Compression Algorithms})
• @+ is Okapi, @ is LR@+ is Okapi, @ is LR• !MERGE_CMBZ is a normalized score summation !MERGE_CMBZ is a normalized score summation
and enhancementand enhancement
March 2, 2004 Ray R. Larson
INEX CO Runs INEX CO Runs INEX CO Runs INEX CO Runs
Generalized Strict
Avg PrecFUSION = 0.0642NEWPARMS = 0.0582FDBK = 0.0415POSTFUS = 0.0690
Avg PrecFUSION = 0.0923NEWPARMS = 0.0853FDBK = 0.0390POSTFUS = 0.0952
March 2, 2004 Ray R. Larson
INEX VCAS RunsINEX VCAS RunsINEX VCAS RunsINEX VCAS Runs
• Two official runsTwo official runs– FUSVCAS - Element fusion using LR and FUSVCAS - Element fusion using LR and
various operators for path restrictionvarious operators for path restriction– NEWVCAS - Using the new LR coefficients NEWVCAS - Using the new LR coefficients
for each appropriate index and various for each appropriate index and various operators for path restrictionoperators for path restriction
March 2, 2004 Ray R. Larson
Query Generation - VCASQuery Generation - VCASQuery Generation - VCASQuery Generation - VCAS
• #66 TITLE = //article[about(., intelligent #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board transport systems)]//sec[about(., on-board route planning navigation system for route planning navigation system for automobiles)]automobiles)]
• Submitted query = Submitted query = ((topic @ {intelligent ((topic @ {intelligent transport systems})) !RESTRICT_FROM transport systems})) !RESTRICT_FROM ((sec_words @ {on-board route planning ((sec_words @ {on-board route planning navigation system for automobiles}))navigation system for automobiles}))
• Target elements: sec|ss1|ss2|ss3Target elements: sec|ss1|ss2|ss3
March 2, 2004 Ray R. Larson
VCAS ResultsVCAS ResultsVCAS ResultsVCAS Results
Generalized Strict
Avg PrecFUSVCAS = 0.0321NEWVCAS = 0.0270
Avg PrecFUSVCAS = 0.0601NEWVCAS = 0.0569
March 2, 2004 Ray R. Larson
Heterogeneous TrackHeterogeneous TrackHeterogeneous TrackHeterogeneous Track
• Approach using the Cheshire’s Virtual Approach using the Cheshire’s Virtual Database optionsDatabase options– Primarily a version of distributed IRPrimarily a version of distributed IR– Each collection indexed separatelyEach collection indexed separately– Search via Z39.50 distributed queriesSearch via Z39.50 distributed queries– Z39.50 Attribute mapping used to map query Z39.50 Attribute mapping used to map query
indexes to appropriate elements in a given indexes to appropriate elements in a given collectioncollection
– Only LR used and collection results merged using Only LR used and collection results merged using probability of relevance for each collection resultprobability of relevance for each collection result
March 2, 2004 Ray R. Larson
Heterogeneous Track IssuesHeterogeneous Track IssuesHeterogeneous Track IssuesHeterogeneous Track Issues
• Very large “Documents” Very large “Documents” – Our approach was to segmentOur approach was to segment
• Reporting Xpath after segmenting large Reporting Xpath after segmenting large documentsdocuments
March 2, 2004 Ray R. Larson
Database StorageDatabase StorageDatabase StorageDatabase Storage
AssociatorFile
Page DataFile
SGML/XMLFile
HistoryFile
DTDFileCluster
File
PostingsFile
IndexFile
IndexFile
RemoteRDBMS
ConfigFile
IndexFile
AssociatorFile
Prox data File
March 2, 2004 Ray R. Larson
Client/Server ArchitectureClient/Server ArchitectureClient/Server ArchitectureClient/Server Architecture
• Server Supports:Server Supports:– Database storageDatabase storage– Indexing Indexing – Z39.50 access to local dataZ39.50 access to local data– Boolean and Probabilistic SearchingBoolean and Probabilistic Searching– Relevance FeedbackRelevance Feedback– External SQL database supportExternal SQL database support
• Client Supports:Client Supports:– Programmable (Tcl/Tk – Python soon) Graphical User InterfaceProgrammable (Tcl/Tk – Python soon) Graphical User Interface– Z39.50 access to remote serversZ39.50 access to remote servers– SGML & MARC formattingSGML & MARC formatting
• Combined Client/Server CGI scripting via WebCheshireCombined Client/Server CGI scripting via WebCheshire
March 2, 2004 Ray R. Larson
Z39.50 OverviewZ39.50 OverviewZ39.50 OverviewZ39.50 Overview
UI
UI
MapQuery
Internet
MapResults
MapQuery
MapResults
MapQuery
MapResults
SearchEngine
March 2, 2004 Ray R. Larson
Two Protocols: HTTP & Z39.50Two Protocols: HTTP & Z39.50
SYSTEM BEHAVIOR HTTP Z39.50State maintenance client serverSessions no yesPolicies adaptable to link speed no yesSynch/asynch synch bothFixed/negotiated protocol fixed negFixed/negotiated doc formats none NegStandardized Metadata no yes
March 2, 2004 Ray R. Larson
Server Z39.50 SupportServer Z39.50 SupportServer Z39.50 SupportServer Z39.50 Support
• Locally developed Z39.50 LibraryLocally developed Z39.50 Library
• Extended version 3 supportExtended version 3 support– support version 3 attributes in BIB-1 including support version 3 attributes in BIB-1 including
“stem”, “relevance”, etc. Also adding support “stem”, “relevance”, etc. Also adding support for “type 102” ranked queries (version 4)for “type 102” ranked queries (version 4)
• Can provide both MARC, SUTRS and Can provide both MARC, SUTRS and SGML records, support for Explain and SGML records, support for Explain and GRS-1 conversion of any SGML recordsGRS-1 conversion of any SGML records
March 2, 2004 Ray R. Larson
Distributed SearchDistributed SearchDistributed SearchDistributed Search
March 2, 2004 Ray R. Larson
The ProblemThe ProblemThe ProblemThe Problem• The Digital Library vision -- Access to everyone The Digital Library vision -- Access to everyone
for “all human knowledge”for “all human knowledge”• Lyman and Varian’s estimates of the “Dark Web”Lyman and Varian’s estimates of the “Dark Web”• Hundreds or Thousands of servers with databases Hundreds or Thousands of servers with databases
ranging widely in content, topic, formatranging widely in content, topic, format– Broadcast search is expensive in terms of bandwidth Broadcast search is expensive in terms of bandwidth
and in processing too many irrelevant resultsand in processing too many irrelevant results– How to select the “best” ones to search?How to select the “best” ones to search?
• Which resource to search first?Which resource to search first?• Which to search next if more is wanted?Which to search next if more is wanted?
– Topical /domain constraints on the search selectionsTopical /domain constraints on the search selections– Variable contents of database (metadata only, full text, Variable contents of database (metadata only, full text,
multimedia…)multimedia…)
March 2, 2004 Ray R. Larson
Distributed Search TasksDistributed Search TasksDistributed Search TasksDistributed Search Tasks• Resource DescriptionResource Description
– How to collect metadata about digital libraries and their How to collect metadata about digital libraries and their collections or databasescollections or databases
• Resource SelectionResource Selection– How to select relevant digital library collections or databases How to select relevant digital library collections or databases
from a large number of databasesfrom a large number of databases
• Distributed SearchDistributed Search– How to perform parallel or sequential searching over the How to perform parallel or sequential searching over the
selected digital library databasesselected digital library databases
• Data FusionData Fusion– How to merge query results from different digital libraries with How to merge query results from different digital libraries with
their different search engines, differing record structures, etc.their different search engines, differing record structures, etc.
March 2, 2004 Ray R. Larson
An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery
An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery
• Distributed resource representation and discoveryDistributed resource representation and discovery– New approach to building resource descriptions based on New approach to building resource descriptions based on
Z39.50Z39.50– Instead of using Instead of using broadcastbroadcast search across resources we are using search across resources we are using
two Z39.50 Servicestwo Z39.50 Services• Identification of database metadata using Z39.50 Identification of database metadata using Z39.50 ExplainExplain• Extraction of distributed indexes using Z39.50 Extraction of distributed indexes using Z39.50 SCANSCAN
• Evaluation Evaluation – How efficiently can we build distributed indexes? How efficiently can we build distributed indexes? – How effectively can we choose databases using the index?How effectively can we choose databases using the index?– How effective is merging search results from multiple sources?How effective is merging search results from multiple sources?– Can we build hierarchies of servers Can we build hierarchies of servers
(general/meta-topical/individual)?(general/meta-topical/individual)?
March 2, 2004 Ray R. Larson
Z39.50 ExplainZ39.50 ExplainZ39.50 ExplainZ39.50 Explain
• Explain supports searches for Explain supports searches for – Server-Level metadata Server-Level metadata
• Server NameServer Name
• IP AddressesIP Addresses
• Ports Ports
– Database-Level metadataDatabase-Level metadata• Database nameDatabase name
• Search attributes (indexes and combinations) Search attributes (indexes and combinations)
– Support metadata (record syntaxes, etc)Support metadata (record syntaxes, etc)
March 2, 2004 Ray R. Larson
Z39.50 SCANZ39.50 SCANZ39.50 SCANZ39.50 SCAN
• Originally intended to support Browsing Originally intended to support Browsing • Query for Query for
– DatabaseDatabase– Attributes plus Term (i.e., index and start point)Attributes plus Term (i.e., index and start point)– Step SizeStep Size– Number of terms to retrieveNumber of terms to retrieve– Position in Response setPosition in Response set
• Results Results – Number of terms returnedNumber of terms returned– List of Terms and their frequency in the database (for List of Terms and their frequency in the database (for
the given attribute combination)the given attribute combination)
March 2, 2004 Ray R. Larson
Z39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN Results% zscan title cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 27}{cat-fight 1}{catalan 19}{catalogu 37}{catalonia 8}{catalyt 2}{catania 1}{cataract 1}{catch 173}{catch-all 3}{catch-up 2} …
zscan topic cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 706}{cat-and-mouse 19}{cat-burglar 1}{cat-carrying 1}{cat-egory 1}{cat-fight 1}{cat-gut 1}{cat-litter 1}{cat-lovers 2}{cat-pee 1}{cat-run 1}{cat-scanners 1} …
Syntax: zscan indexname1 term stepsize number_of_terms pref_pos
March 2, 2004 Ray R. Larson
Resource Index CreationResource Index CreationResource Index CreationResource Index Creation• For all servers, or a topical subset…For all servers, or a topical subset…
– Get Explain information Get Explain information – For each indexFor each index
• Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency• Add term + freq + source index + database metadata Add term + freq + source index + database metadata
to the XML “Collection Document” for the resourceto the XML “Collection Document” for the resource– Planned extensions:Planned extensions:
• Post-Process indexes (especially Geo Names, etc) Post-Process indexes (especially Geo Names, etc) for special types of data for special types of data
– e.g. create “geographical coverage” indexese.g. create “geographical coverage” indexes
March 2, 2004 Ray R. Larson
MetaSearch ApproachMetaSearch ApproachMetaSearch ApproachMetaSearch Approach
MetaSearchServer
Map ExplainAnd ScanQueries
Internet
MapResults
MapQuery
MapResults
SearchEngine
DB2DB 1
MapQuery
MapResults
SearchEngine
DB 4DB 3
DistributedIndex
SearchEngine
Db 6Db 5
March 2, 2004 Ray R. Larson
Known Issues and ProblemsKnown Issues and ProblemsKnown Issues and ProblemsKnown Issues and Problems
• Not all Z39.50 Servers support SCAN or ExplainNot all Z39.50 Servers support SCAN or Explain• Solutions that appear to work well:Solutions that appear to work well:
– Probing for attributes instead of explain (e.g. DC Probing for attributes instead of explain (e.g. DC attributes or analogs)attributes or analogs)
– We also support OAI and can extract OAI metadata for We also support OAI and can extract OAI metadata for servers that support OAIservers that support OAI
– Query-based sampling (Callan)Query-based sampling (Callan)
• Collection Documents are static and need to be Collection Documents are static and need to be replaced when the associated collection changesreplaced when the associated collection changes
March 2, 2004 Ray R. Larson
Evaluation Evaluation Evaluation Evaluation
• Test EnvironmentTest Environment– TREC Tipster data (approx. 3 GB)TREC Tipster data (approx. 3 GB)
– Partitioned into 236 smaller collections based on source Partitioned into 236 smaller collections based on source and date by month (no DOE)and date by month (no DOE)
• High size variability (from 1 to thousands of records)High size variability (from 1 to thousands of records)
• Same database as used in other distributed search studies by J. Same database as used in other distributed search studies by J. French and J. Callan among othersFrench and J. Callan among others
– Used TREC topics 51-150 for evaluation (these are the Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 only topics with relevance judgements for all 3 TIPSTER disksTIPSTER disks
March 2, 2004 Ray R. Larson
Harvesting EfficiencyHarvesting EfficiencyHarvesting EfficiencyHarvesting Efficiency
• Tested using the databases on the previous slide + Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb)the full FT database (210,158 records ~ 600 Mb)
• Average of 23.07 seconds per database to SCAN Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create each database (3.4 indexes on average) and create a collection representative, over the networka collection representative, over the network
• Average of 14.07 secondsAverage of 14.07 seconds• Also tested larger databases (E.g. TREC FT Also tested larger databases (E.g. TREC FT
database ~600 Mb with 7 indexes was harvested in database ~600 Mb with 7 indexes was harvested in 131 seconds. 131 seconds.
March 2, 2004 Ray R. Larson
Our Collection Ranking Our Collection Ranking ApproachApproach
Our Collection Ranking Our Collection Ranking ApproachApproach
• We attempt to estimate the probability of We attempt to estimate the probability of relevance for a given collection with respect to relevance for a given collection with respect to a query using the Logistic Regression method a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for Dabney, A. Chen) with new algorithm for weight calculation at retrieval timeweight calculation at retrieval time
• Estimates from multiple extracted indexes are Estimates from multiple extracted indexes are combined to provide an overall ranking score combined to provide an overall ranking score for a given resource (I.e., fusion of multiple for a given resource (I.e., fusion of multiple query results)query results)
March 2, 2004 Ray R. Larson
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
6
10),|(
iii XccCQRP
Probability of relevance for a given index is based on logistic regression from a sample set documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:
March 2, 2004 Ray R. Larson
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
MX
n
nNICF
ICFM
X
CLX
CAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
10
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Collection Frequency
Collection size estimate
Average Inverse Collection Frequency
Inverse Document Frequency (N = Number of collections
M = Number of Terms in common between query and document
March 2, 2004 Ray R. Larson
EvaluationEvaluationEvaluationEvaluation• Effectiveness Effectiveness
– Tested using the collection representatives described Tested using the collection representatives described above (as harvested from over the network) and the above (as harvested from over the network) and the TIPSTER relevance judgements TIPSTER relevance judgements
– Testing by comparing our approach to known Testing by comparing our approach to known algorithms for ranking collectionsalgorithms for ranking collections
– Results were measured against reported results for the Results were measured against reported results for the Ideal and CORI algorithms and against the optimal Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX)“Relevance Based Ranking” (MAX)
– Recall analog (How many of the Rel docs occurred in Recall analog (How many of the Rel docs occurred in the top n databases – averaged)the top n databases – averaged)
March 2, 2004 Ray R. Larson
Titles only (short query)Titles only (short query)Titles only (short query)Titles only (short query)
R̂
March 2, 2004 Ray R. Larson
FutureFutureFutureFuture
• Logically Clustering servers by topicLogically Clustering servers by topic
• Meta-Meta Servers (treating the Meta-Meta Servers (treating the MetaSearch database as just another MetaSearch database as just another database)database)
March 2, 2004 Ray R. Larson
Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers
Replicatedservers
Meta-TopicalServers
General ServersDatabaseServers
March 2, 2004 Ray R. Larson
Geographic Operators and Search Geographic Operators and Search RankingRanking
Geographic Operators and Search Geographic Operators and Search RankingRanking
March 2, 2004 Ray R. Larson
The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations
• Operators established for the GEO Z39.50 profileOperators established for the GEO Z39.50 profile• Implemented using special operations on indexesImplemented using special operations on indexes• Indexing allows extraction of geographic Indexing allows extraction of geographic
coordinates and dates from SGML/XML data in a coordinates and dates from SGML/XML data in a variety of formatsvariety of formats
• Normalized internal representation in indexesNormalized internal representation in indexes• Search using geographic and time elements as Search using geographic and time elements as
primary or limiting search elementsprimary or limiting search elements
March 2, 2004 Ray R. Larson
The GEO OperationsThe GEO OperationsThe GEO OperationsThe GEO Operations
• X-based interfaces permit (simple) map X-based interfaces permit (simple) map drawing and searchdrawing and search
• Interface to MapServer for web-based map Interface to MapServer for web-based map searchingsearching
March 2, 2004 Ray R. Larson
GEO Geographic operatorsGEO Geographic operatorsGEO Geographic operatorsGEO Geographic operators>=< >=< OverlapOverlap Search region and data OverlapSearch region and data Overlap
>#< >#< Fully EnclosedFully Enclosed Data fully enclosed in search reg.Data fully enclosed in search reg.
<#><#> EnclosesEncloses Data fully encloses search regionData fully encloses search region
<>#<># Fully Outside Fully Outside Data outside of search regionData outside of search region
++++ NearNear Data is near search regionData is near search region
:<::<: BeforeBefore Data date is before search dateData date is before search date
:<=::<=: Before or Before or DuringDuring
Data date is before or during Data date is before or during search datesearch date
:>=::>=: During or During or AfterAfter
Data date is during or after search Data date is during or after search datedate
:>::>: AfterAfter Data date is after search dateData date is after search date
March 2, 2004 Ray R. Larson
Overlaps searchOverlaps searchOverlaps searchOverlaps search
March 2, 2004 Ray R. Larson
Fully Enclosed SearchFully Enclosed SearchFully Enclosed SearchFully Enclosed Search
March 2, 2004 Ray R. Larson
Map-Based SearchMap-Based SearchMap-Based SearchMap-Based Search
March 2, 2004 Ray R. Larson
GeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web InterfaceGeoSearch Web Interface
March 2, 2004 Ray R. Larson
MySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQLMySQL and PostgreSQL
March 2, 2004 Ray R. Larson
RDBMS SupportRDBMS SupportRDBMS SupportRDBMS Support
• There are two reasons for RDBMS supportThere are two reasons for RDBMS support– IR systems are not meant for LOTS of update IR systems are not meant for LOTS of update
transactionstransactions
– Some application need to have access to both relational Some application need to have access to both relational data and text data via Z39.50data and text data via Z39.50
• Both MySQL and PostgreSQL are popular open Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used source RDBMS and now either can now be used via Cheshirevia Cheshire– Z39.50 mappings to RDBMS columnsZ39.50 mappings to RDBMS columns
– ““ZQL” submission of SQL as Z39.50 Type 0 queryZQL” submission of SQL as Z39.50 Type 0 query
March 2, 2004 Ray R. Larson
Protocol SupportProtocol SupportProtocol SupportProtocol Support
March 2, 2004 Ray R. Larson
ProtocolsProtocolsProtocolsProtocols
• In Cheshire II most protocols (except In Cheshire II most protocols (except Z39.50) are implemented using scriptingZ39.50) are implemented using scripting
• Example scripts to support the following Example scripts to support the following are included in the distribution are included in the distribution – OAIOAI– SRW (Python version)SRW (Python version)– SOAPSOAP– SDLIPSDLIP
March 2, 2004 Ray R. Larson
Cheshire III Design and Cheshire III Design and DevelopmentDevelopment
Cheshire III Design and Cheshire III Design and DevelopmentDevelopment
March 2, 2004 Ray R. Larson
Cheshire III GoalsCheshire III GoalsCheshire III GoalsCheshire III Goals• Retain or reproduce (and refine) all Cheshire II Retain or reproduce (and refine) all Cheshire II
featuresfeatures– ““Spring cleaning” of code baseSpring cleaning” of code base– Add Full Unicode Support Add Full Unicode Support – Store most system and content data in the databaseStore most system and content data in the database
• Permit easy and efficient integration in Web Permit easy and efficient integration in Web ServicesServices
• Use threaded server for economy of resource usageUse threaded server for economy of resource usage• Enhanced Multiprotocol support Enhanced Multiprotocol support • Support for distributed processing (I.e. GRID Support for distributed processing (I.e. GRID
clusters)clusters)• Enhance expandability and “drop in’ functionalityEnhance expandability and “drop in’ functionality• Interfaces and/or APIs for Java, Python, C/C++Interfaces and/or APIs for Java, Python, C/C++
March 2, 2004 Ray R. Larson
Cheshire II Design OverviewCheshire II Design OverviewCheshire II Design OverviewCheshire II Design Overview
XML DOCS
XMLDIRECTORY
INDEXCLUSTER
INDEXCHESHIRE
CONT
BUILD ASSOC
ZSERVER
CONFIG
COMPONENTDEFINITION
INDEX(S)
ASSOC
CLUSTEREXTENSION
March 2, 2004 Ray R. Larson
Cheshire III Server OverviewCheshire III Server OverviewCheshire III Server OverviewCheshire III Server Overview
API
INDEXING
T R RX E AS C NL O ST R F D O R M S
SEARCH
P HR AO NT DO LC EO RL
DB API
REMOTESYSTEMS
(any protocol)
XMLCONFIG
& MetadataINFO
INDEXES
LOCAL DB
STAFF UI
CONFIG
NETWORK
RESULTSETS
SCAN
USERINFOC
ONFIG&CONTROL
ACCESSINFO
AUTHENTICATION
CLUSTERING
Native calls
Z39.50SOAPOAI
JDBC
Fetch IDPut ID
OpenURL
APACHE
INTERFACE
SERVERCONTROL
UDDIWSRP
SRW
Normalization
ClientUser/
Clients
OGIS
Cheshire III SERVER
March 2, 2004 Ray R. Larson
API
INDEXING
T R RX E AS C NL O ST R F D O R M S
SEARCH
P HR AO NT DO LC EO RL
DB API
REMOTESYSTEMS
(any protocol)
XMLCONFIG
& MetadataINFO
INDEXES
LOCAL DB
STAFF UI
CONFIG
NETWORK
RESULTSETS
SCAN
USERINFO
CONFIG&CONTROL
ACCESSINFO
AUTHENTICATION
CLUSTERING
Native calls
Z39.50
SOAP
OAI
JDBC
Fetch ID
Put ID
OpenURL
APACHE
INTERFACE
SERVERCONTROL
UDDI
WSRP
SRW
Normalization
ClientUser/
Clients
OGIS
Cheshire III SERVER
March 2, 2004 Ray R. Larson
Retain FeaturesRetain FeaturesRetain FeaturesRetain Features
• The intent is to permit all of the types of in The intent is to permit all of the types of in indexing, searching and record formatting indexing, searching and record formatting available now, while making it easier to add available now, while making it easier to add new capabilitiesnew capabilities
• The new system will also support full The new system will also support full UNICODE for content and for metadataUNICODE for content and for metadata
• Store metadata and content in the database Store metadata and content in the database (including config information, etc.)(including config information, etc.)
March 2, 2004 Ray R. Larson
Permit easy integration of Web Permit easy integration of Web ServicesServices
Permit easy integration of Web Permit easy integration of Web ServicesServices
• The assumption is that the web server will The assumption is that the web server will be the central server mechanism in the be the central server mechanism in the future.future.
• The new design relies on the session The new design relies on the session handling, threading and load management handling, threading and load management tools available in Apache (2.0.40+)tools available in Apache (2.0.40+)
• The Cheshire server is dynamically loaded The Cheshire server is dynamically loaded as part of the Web Serveras part of the Web Server
March 2, 2004 Ray R. Larson
Multiprotocol SupportMultiprotocol SupportMultiprotocol SupportMultiprotocol Support
• The Web server handles the network issues The Web server handles the network issues and passes requests in various protocols and passes requests in various protocols along to the Cheshire Server. along to the Cheshire Server.
• Individual Protocol “plugins” and the Individual Protocol “plugins” and the Protocol Handler convert search, display, Protocol Handler convert search, display, and metadata requests in a particular and metadata requests in a particular protocol to the internal Cheshire III control protocol to the internal Cheshire III control language, and convert outgoing message language, and convert outgoing message and data to the appropriate protocol formand data to the appropriate protocol form
March 2, 2004 Ray R. Larson
Distributed & GRID ProcessingDistributed & GRID ProcessingDistributed & GRID ProcessingDistributed & GRID Processing• The server will support protocols for interchange The server will support protocols for interchange
of partial results and collection statistics with a of partial results and collection statistics with a single “Master” controlling the actions of a large single “Master” controlling the actions of a large number of “Slave” serversnumber of “Slave” servers
• These will run in parallel in a GRID environmentThese will run in parallel in a GRID environment• This is still “research” but will probably be using This is still “research” but will probably be using
“Storage Grid” technology from SDSC with our “Storage Grid” technology from SDSC with our own applicationsown applications
• Non-Grid use of the same protocols, etc will be Non-Grid use of the same protocols, etc will be possible (but definitely slower)possible (but definitely slower)
March 2, 2004 Ray R. Larson
Enhanced ExpanabilityEnhanced ExpanabilityEnhanced ExpanabilityEnhanced Expanability
• Clearly defined APIs for interacting with Clearly defined APIs for interacting with the server will permit easy addition of new the server will permit easy addition of new functionality, or to replace or upgrade functionality, or to replace or upgrade existing functionalityexisting functionality
• Interactive user interface for database Interactive user interface for database configuration and setupconfiguration and setup– We want to make it easier for a We want to make it easier for a
user/administrator to create and manage the user/administrator to create and manage the databasedatabase
March 2, 2004 Ray R. Larson
Multilingual APIsMultilingual APIsMultilingual APIsMultilingual APIs
• The system is being developed in a The system is being developed in a multilingual environment.multilingual environment.
• We will include the ability to interface with We will include the ability to interface with (at a minimum) Java, Python and C/C++ (at a minimum) Java, Python and C/C++ applications.applications.
• APIs for developing new functions will be APIs for developing new functions will be available in these languages as well available in these languages as well
March 2, 2004 Ray R. Larson
DevelopmentDevelopmentDevelopmentDevelopment
• Currently work is going on here (RRL) and Currently work is going on here (RRL) and (primarily) in the UK(primarily) in the UK
• We have incomplete (Alpha) versions of the We have incomplete (Alpha) versions of the system, but haven’t been distributing it in system, but haven’t been distributing it in the current form (changing constantly)the current form (changing constantly)
• First release version is expected in mid-’04First release version is expected in mid-’04
March 2, 2004 Ray R. Larson
Further InformationFurther InformationFurther InformationFurther Information
• Full Cheshire II client and server is open source Full Cheshire II client and server is open source and available for academic and government use: and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ftp://cheshire.berkeley.edu/pub/cheshire/– Includes HTML documentationIncludes HTML documentation
• Project Web Site Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/• Archives Hub http://www.archiveshub.ac.uk/Archives Hub http://www.archiveshub.ac.uk/