august 22, 2001 nasa ames lecture -- ray r. larson xml structured document retrieval and distributed...

August 22, 2001 NASA Ames Lecture -- Ray R. Larson

XML Structured Document XML Structured Document Retrieval and Distributed Resource Retrieval and Distributed Resource

DiscoveryDiscovery Ray R. LarsonRay R. Larson

School of Information Management & SystemsSchool of Information Management & SystemsUniversity of California, BerkeleyUniversity of California, Berkeley

[email protected]@sherlock.berkeley.edu


ContextContextContextContext

• NSF/JISC International Digital Library Grant NSF/JISC International Digital Library Grant – Cross-Domain Resource Discovery: Integrated Discovery and Use of Cross-Domain Resource Discovery: Integrated Discovery and Use of

Textual, Numeric and Spatial DataTextual, Numeric and Spatial Data• UC Berkeley DLI2 Grant: UC Berkeley DLI2 Grant:

– ReInventing Scholarly Information AccessReInventing Scholarly Information Access• UC Berkeley working with the University of Liverpool/Manchester UC Berkeley working with the University of Liverpool/Manchester

Computing with participation fromComputing with participation from– DeMontfort University (MASTER)DeMontfort University (MASTER)– Art and Humanities Data Service (http://ahds.ac.uk/)Art and Humanities Data Service (http://ahds.ac.uk/)

• OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & Northumbria) Northumbria)

– Consortium of University Research Libraries (CURL)Consortium of University Research Libraries (CURL)– UC Berkeley Library (and California Digital Library)UC Berkeley Library (and California Digital Library)

• Making of America IIMaking of America II• Online Archive of CaliforniaOnline Archive of California

– British Natural History Museum, LondonBritish Natural History Museum, London– NESSTAR (NEtworked Social Science Tools and Resources)NESSTAR (NEtworked Social Science Tools and Resources)


Research AreasResearch AreasResearch AreasResearch Areas

• Goals are Goals are – Practical application of existing DL Practical application of existing DL

technologies to some large-scale cross-domain technologies to some large-scale cross-domain collectionscollections

– Theoretical examination and evaluation of next-Theoretical examination and evaluation of next-generation designs for systems architecture and generation designs for systems architecture and and distributed cross-domain searching for DLsand distributed cross-domain searching for DLs


ApproachApproachApproachApproach

• For the first goal, we are implementing a For the first goal, we are implementing a distributed search system based on distributed search system based on international standards (Z39.50 and international standards (Z39.50 and SGML/XML) using the Cheshire II information SGML/XML) using the Cheshire II information retrieval systemretrieval system

• Databases include:Databases include:– HE Archives hubHE Archives hub– Arts and Humanities Data Service (AHDS)Arts and Humanities Data Service (AHDS)– MASTERMASTER– CURL (Consortium of University Research CURL (Consortium of University Research

Libraries) Libraries) – Online Archive of California (OAC)Online Archive of California (OAC)– Making of America II (MOA2) Making of America II (MOA2)


Current Usage of Cheshire IICurrent Usage of Cheshire IICurrent Usage of Cheshire IICurrent Usage of Cheshire II• Web clients for:Web clients for:

– Berkeley NSF/NASA/ARPA Digital Library Berkeley NSF/NASA/ARPA Digital Library – World Conservation Digital LibraryWorld Conservation Digital Library– SunSite (UC Berkeley Science Libraries)SunSite (UC Berkeley Science Libraries)– University of LiverpoolUniversity of Liverpool– Higher Education Archives HubHigher Education Archives Hub

• Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded)Studies, Manchester, Southhampton, Warwick and others (to be expanded)

– University of Essex, HDS (part of AHDS)University of Essex, HDS (part of AHDS)– Oxford Text Archive (test only)Oxford Text Archive (test only)– California Sheet Music ProjectCalifornia Sheet Music Project– Cha-Cha (Berkeley Intranet Search Engine)Cha-Cha (Berkeley Intranet Search Engine)– Berkeley Metadata project cross-language demoBerkeley Metadata project cross-language demo– Univ. of Virginia (test implementations)Univ. of Virginia (test implementations)– Cheshire ranking algorithm is basis for original InktomiCheshire ranking algorithm is basis for original Inktomi


Current and Upcoming Usage of Current and Upcoming Usage of Cheshire IICheshire II

Current and Upcoming Usage of Current and Upcoming Usage of Cheshire IICheshire II

• DIEPER Digitized European Periodicals project. DIEPER Digitized European Periodicals project. – http://gdz.sub.uni-goettingen.de/dieper/http://gdz.sub.uni-goettingen.de/dieper/

• NESSTAR (Networked Social Science Tools and NESSTAR (Networked Social Science Tools and Resources. Resources. – http://www.nesstar.org/ http://www.nesstar.org/

• FASTER – Flexible Access to Statistics Tables and FASTER – Flexible Access to Statistics Tables and Electronic Resources. (Continuation of NESSTAR)Electronic Resources. (Continuation of NESSTAR)– http://www.faster-data.org/ http://www.faster-data.org/

• MASTER (Manuscript Access through Standards for MASTER (Manuscript Access through Standards for Electronic Records. Electronic Records. – http://www.cta.dmu.ac.uk/projects/master/http://www.cta.dmu.ac.uk/projects/master/


Upcoming Usage of Cheshire IIUpcoming Usage of Cheshire IIUpcoming Usage of Cheshire IIUpcoming Usage of Cheshire II• ZETOC (Prototype of the Electronic Table of Contents ZETOC (Prototype of the Electronic Table of Contents

from the British Library)from the British Library)– http://zetoc.mimas.ac.uk/http://zetoc.mimas.ac.uk/

• Archives HubArchives Hub– http://www.archiveshub.ac.uk/http://www.archiveshub.ac.uk/

• RSLP Palaeography projectRSLP Palaeography project– http://www.palaeography.ac.uk/http://www.palaeography.ac.uk/

• British Natural History Museum, London British Natural History Museum, London • JISC data services directory hosted by MIMAS JISC data services directory hosted by MIMAS • Resource Discovery Network (RDN), where it will be Resource Discovery Network (RDN), where it will be

used to harvest RDN records from the various hubs using used to harvest RDN records from the various hubs using OAI and provide searchOAI and provide search


Client/Server ArchitectureClient/Server ArchitectureClient/Server ArchitectureClient/Server Architecture

• Server Supports:Server Supports:– Database storageDatabase storage– Indexing Indexing – Z39.50 access to local dataZ39.50 access to local data– Boolean and Probabilistic SearchingBoolean and Probabilistic Searching– Relevance FeedbackRelevance Feedback– External SQL database supportExternal SQL database support

• Client Supports:Client Supports:– Programmable (Tcl/Tk) Graphical User InterfaceProgrammable (Tcl/Tk) Graphical User Interface– Z39.50 access to remote serversZ39.50 access to remote servers– SGML/XML & MARC formattingSGML/XML & MARC formatting

• Combined Client/Server CGI scripting via WebCheshire used Combined Client/Server CGI scripting via WebCheshire used for web applicationsfor web applications


SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• Underlying native format for all data is SGML/XMLUnderlying native format for all data is SGML/XML• The DTD defines the file format for each fileThe DTD defines the file format for each file• Full SGML/XML parsingFull SGML/XML parsing• XML Configuration Files define the databaseXML Configuration Files define the database• USMARC DTD and MARC to SGML conversion USMARC DTD and MARC to SGML conversion

(and back again)(and back again)• Access to full-text via special SGML tagsAccess to full-text via special SGML tags• Support for SGML/XML component definition and Support for SGML/XML component definition and

indexingindexing

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support• Example XML record for a DL documentExample XML record for a DL document

<ELIB-BIB><BIB-VERSION>ELIB-v1.0</BIB-VERSION><ID>756</ID><ENTRY>June 12, 1996</ENTRY><DATE>June 1996</DATE><TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE><ORGANIZATION>University of California</ORGANIZATION><TYPE>report</TYPE><AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL><AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL><PROJECT>SNEP</PROJECT><SERIES>Vol 3</SERIES><PAGES>40</PAGES><TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF><PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF></ELIB-BIB>

<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a>theory and practice /<c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a>J. Wiley,<c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a>ill. ;<c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...

SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support• Example SGML/MARC RecordExample SGML/MARC Record


SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support

• Configuration files for the Server are also Configuration files for the Server are also SGML/XML:SGML/XML:– They include tags describing all of the data files They include tags describing all of the data files

and indexes for the database.and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to

be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.

– They include definition of components and They include definition of components and component indexescomponent indexes


Component Extraction and Component Extraction and RetrievalRetrieval


• Any sub-elements of an SGML/XML document Any sub-elements of an SGML/XML document can be defined as a separately indexed can be defined as a separately indexed “component”. “component”.

• Components can be ranked and retrieved Components can be ranked and retrieved independently of the source document (but linked independently of the source document (but linked back to their original source)back to their original source)

• For example paragraphs and abstracts in the full For example paragraphs and abstracts in the full text of documents could be defined as components text of documents could be defined as components to provide paragraph-level searchto provide paragraph-level search

• Example: Glassier archives…Example: Glassier archives…




• The Glassier archive is an EAD document The Glassier archive is an EAD document (1.9 Mb in size)(1.9 Mb in size)

• Contains “Series, Subseries, and Item level” Contains “Series, Subseries, and Item level” descriptions of things in the archivedescriptions of things in the archive

Excerpt from Glasier ArchiveExcerpt from Glasier ArchiveExcerpt from Glasier ArchiveExcerpt from Glasier Archive<c level="subseries"><did><head>GP-1-1: General correspondence. Public letters.</head><unitid id="gp-1-1">GP-1-1</unitid><unittitle>Glasier Papers. General correspondence. Public letters.</unittitle></did><arrangement><head>Arrangement </head>Public letters arranged alphabetically within each year </arrangement><c level="item" langmaterial="eng"><did><unitid id="gp-1-1-0001">GP-1-1-0001</unitid><unittitle>Letter from Richard Murray. <geogname>Glasgow</geogname>; <unitdate>7 Apr 1879</unitdate>.</unittitle><origination><persname>Murray, Richard</persname></origination><physdesc><extent>1 letter</extent></physdesc></did><note>Employment reference for J.B.G. as draughtsman<subject>Glasier, JohnBruce</subject></note></c>

ETC….

Example Component DefExample Component DefExample Component DefExample Component Def…<COMPONENTS><COMPONENTDEF><COMPONENTNAME>

/home/ray/Work/Glasier_test/indexes/COMPONENT_DB1 </COMPONENTNAME>

<COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><tagspec><FTAG> c </FTAG><ATTR> level <VALUE>item</VALUE></ATTR></tagspec></COMPSTARTTAG><COMPONENTINDEXES>…


ComponentsComponentsComponentsComponents

• Both individual tags and “ranges” with a Both individual tags and “ranges” with a starting tag and (different) ending tag can starting tag and (different) ending tag can be used as componentsbe used as components

• Components permit parts of complex Components permit parts of complex SGML/XML documents to be treated as SGML/XML documents to be treated as separate documentsseparate documents


Cheshire II SearchingCheshire II SearchingCheshire II SearchingCheshire II Searching

Z39.50 Internet

ImagesScannedText

Local Remote

Z39.50

Z39.50

Z39.50


Boolean Search CapabilityBoolean Search CapabilityBoolean Search CapabilityBoolean Search Capability

• All Boolean operations are supportedAll Boolean operations are supported– ““zfind author x and (title y or subject z) not subject A”zfind author x and (title y or subject z) not subject A”

• Named sets are supported and stored on the serverNamed sets are supported and stored on the server• Boolean operations between stored sets are Boolean operations between stored sets are

supportedsupported– ““zfind SET1 and subject widgets or SET2”zfind SET1 and subject widgets or SET2”

• Nested parentheses and truncation are supportedNested parentheses and truncation are supported– ““zfind xtitle Alice#”zfind xtitle Alice#”


Probabilistic ModelsProbabilistic ModelsProbabilistic ModelsProbabilistic Models

• Rigorous formal model attempts to predict Rigorous formal model attempts to predict the probability that a given document will the probability that a given document will be relevant to a given querybe relevant to a given query

• Ranks retrieved documents according to Ranks retrieved documents according to this probability of relevance (Probability this probability of relevance (Probability Ranking Principle)Ranking Principle)

• Rely on accurate estimates of probabilitiesRely on accurate estimates of probabilities


Probability Ranking PrincipleProbability Ranking PrincipleProbability Ranking PrincipleProbability Ranking Principle

• If a reference retrieval system’s response to each If a reference retrieval system’s response to each request is a ranking of the documents in the request is a ranking of the documents in the collections in the order of decreasing probability collections in the order of decreasing probability of usefulness to the user who submitted the of usefulness to the user who submitted the request, where the probabilities are estimated as request, where the probabilities are estimated as accurately as possible on the basis of whatever accurately as possible on the basis of whatever data has been made available to the system for this data has been made available to the system for this purpose, then the overall effectiveness of the purpose, then the overall effectiveness of the system to its users will be the best that is system to its users will be the best that is obtainable on the basis of that data.obtainable on the basis of that data.

Stephen E. Robertson, J. Documentation 1977


Probabilistic Models: Logistic Probabilistic Models: Logistic RegressionRegression

Probabilistic Models: Logistic Probabilistic Models: Logistic RegressionRegression

• Estimates for relevance based on log-linear Estimates for relevance based on log-linear model with various statistical measures of model with various statistical measures of document content as independent variables.document content as independent variables.

nnkji vcvcvcctdR|qO ...),,(log 22110

)),|(log(1

1),|(

ji dqROjie

dqRP

m

kkjiji ROtdqROdqRO

1, )](log),|([log),|(log

Log odds of relevance is a linear function of attributes:

Term contributions summed:

Probability of Relevance is inverse of log odds:


Logistic RegressionLogistic RegressionLogistic RegressionLogistic Regression

100 -90 -80 -70 -60 -50 -40 -30 -20 -10 -0 -

0 10 20 30 40 50 60Term Frequency in Document

Rel

evan

ce


Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

6

10),|(

iii XccDQRP

Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown on the next slide


Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged


Cheshire Probabilistic RetrievalCheshire Probabilistic RetrievalCheshire Probabilistic RetrievalCheshire Probabilistic Retrieval

• Uses Logistic Regression ranking method developed at Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at Berkeley with new algorithm for weigh calculation at retrieval time.retrieval time.

• Z39.50 “relevance” operator used to indicate Z39.50 “relevance” operator used to indicate probabilistic searchprobabilistic search

• Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed:– zfind topic @ “cheshire cats, looking glasses, march hares zfind topic @ “cheshire cats, looking glasses, march hares

and other such things”and other such things”– zfind title @ caucus raceszfind title @ caucus races

• Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined:– zfind topic @ government documents and title guidebookszfind topic @ government documents and title guidebooks


Combining Search TypesCombining Search TypesCombining Search TypesCombining Search Types• It is also possible to combine the results of multiple It is also possible to combine the results of multiple

independent searches into a single result set. (using the independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system) Z39.50 SORT service of the Cheshire system) – E.g.:E.g.:– Search of Full Text (Probabilistic)Search of Full Text (Probabilistic)– Search of Full Text (Boolean)Search of Full Text (Boolean)– Search of Components (Probabilistic)Search of Components (Probabilistic)– Search of Titles (Probabilistic)Search of Titles (Probabilistic)– Search of Subject Headings (Probabilistic)Search of Subject Headings (Probabilistic)

• All result sets are merged and re-ranked to produce the All result sets are merged and re-ranked to produce the final list.final list.


Relevance Feedback.Relevance Feedback.Relevance Feedback.Relevance Feedback.

• Any records in a result set can be used for Any records in a result set can be used for Relevance FeedbackRelevance Feedback

• Uses the “set name” to receive feedback Uses the “set name” to receive feedback instructions.instructions.– zfind SET1:2,5-9,30,45zfind SET1:2,5-9,30,45– zfind SET2:6zfind SET2:6

• Chosen records are used to build a new Chosen records are used to build a new probabilistic queryprobabilistic query

• Ranked results are returnedRanked results are returned• Planned support for (modified) Rocchio RFPlanned support for (modified) Rocchio RF


Cheshire II - Two-Stage Cheshire II - Two-Stage Retrieval (EVM generation)Retrieval (EVM generation)

• Example: Using the LC Classification SystemExample: Using the LC Classification System– Pseudo-Document created for each LC class containing terms Pseudo-Document created for each LC class containing terms

derived from “content-rich” portions of documents in that derived from “content-rich” portions of documents in that class (subject headings, titles, etc.)class (subject headings, titles, etc.)

– Permits searching by Permits searching by any term in the classany term in the class– Ranked Probabilistic retrieval techniques attempt to present Ranked Probabilistic retrieval techniques attempt to present

the “Best Matches” to a query first.the “Best Matches” to a query first.– User selects classes to feed back for the “second stage” search User selects classes to feed back for the “second stage” search

of documents (which includes info from first stage selections)of documents (which includes info from first stage selections)

• Can be used with any classified/Indexed collection and Can be used with any classified/Indexed collection and controlled vocabularycontrolled vocabulary


Automatic Class AssignmentAutomatic Class Assignment

DocDoc

DocDoc

DocDoc

Doc

SearchEngine

1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category

Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme


Cheshire II - Cluster GenerationCheshire II - Cluster GenerationCheshire II - Cluster GenerationCheshire II - Cluster Generation

• Define Define basisbasis for clustering records. for clustering records.– Select field to form the Select field to form the basis basis of the cluster.of the cluster.– EvidenceEvidence Fields to use as contents of the pseudo- Fields to use as contents of the pseudo-

documents.documents.• During indexing cluster keys are generated with During indexing cluster keys are generated with

basisbasis and and evidenceevidence from each record. from each record.• Cluster keys are sorted and merged on basis and Cluster keys are sorted and merged on basis and

pseudo-documents created for each unique pseudo-documents created for each unique basisbasis element containing all evidence fields.element containing all evidence fields.

• Pseudo-Documents (Class clusters) are indexed on Pseudo-Documents (Class clusters) are indexed on combined evidence fields.combined evidence fields.


Cheshire II - Two-Stage Cheshire II - Two-Stage RetrievalRetrieval

• Using the Mesh Subject Heading SystemUsing the Mesh Subject Heading System– Pseudo-Document created for each MESH heading containing Pseudo-Document created for each MESH heading containing

terms derived from “content-rich” portions of documents in terms derived from “content-rich” portions of documents in that class (other subject headings, titles, abstract, etc.)that class (other subject headings, titles, abstract, etc.)

– Permits searching by Permits searching by any term in the classany term in the class– Ranked Probabilistic retrieval techniques attempt to present Ranked Probabilistic retrieval techniques attempt to present

the “Best Matches” to a query first.the “Best Matches” to a query first.– User selects classes to feed back for the “second stage” search User selects classes to feed back for the “second stage” search

of documents.of documents.

• Can be used with any classified/Indexed collection.Can be used with any classified/Indexed collection.


Distributed Search: The Distributed Search: The ProblemProblem

Distributed Search: The Distributed Search: The ProblemProblem

• Hundreds or Thousands of servers with Hundreds or Thousands of servers with databases ranging widely in content, topic, databases ranging widely in content, topic, formatformat– Broadcast search is expensive in terms of Broadcast search is expensive in terms of

bandwidth and in processing too many bandwidth and in processing too many irrelevant resultsirrelevant results

– How to select the “best” ones to search?How to select the “best” ones to search?• What to search firstWhat to search first• Which to search nextWhich to search next

– Topical /domain constraints on the search Topical /domain constraints on the search selectionsselections

– Variable contents of database (metadata only, Variable contents of database (metadata only, full text…)full text…)


An Approach for Cross-Domain An Approach for Cross-Domain Resource DiscoveryResource Discovery

An Approach for Cross-Domain An Approach for Cross-Domain Resource DiscoveryResource Discovery

• MetaSearchMetaSearch– New approach to building metasearch based on Z39.50New approach to building metasearch based on Z39.50– Instead of using Instead of using broadcastbroadcast search we are using two Z39.50 search we are using two Z39.50

ServicesServices• Identification of database metadata using Z39.50 Identification of database metadata using Z39.50 ExplainExplain• Extraction of distributed indexes using Z39.50 Extraction of distributed indexes using Z39.50 SCANSCAN

• Evaluation Evaluation – How efficiently can we build distributed indexes? How efficiently can we build distributed indexes? – How effectively can we choose databases using the index?How effectively can we choose databases using the index?– How effective is merging search results from multiple How effective is merging search results from multiple

sources?sources?– Hierarchies of servers (general/meta-topical/individual)?Hierarchies of servers (general/meta-topical/individual)?


Z39.50 OverviewZ39.50 OverviewZ39.50 OverviewZ39.50 Overview

UI

UI

MapQuery

Internet

MapResults

MapQuery

MapResults

MapQuery

MapResults

SearchEngine


Z39.50 ExplainZ39.50 ExplainZ39.50 ExplainZ39.50 Explain

• Explain supports searches for Explain supports searches for – Server-Level metadata Server-Level metadata

• Server NameServer Name

• IP AddressesIP Addresses

• Ports Ports

– Database-Level metadataDatabase-Level metadata• Database nameDatabase name

• Search attributes (indexes and combinations) Search attributes (indexes and combinations)

– Support metadata (record syntaxes, etc)Support metadata (record syntaxes, etc)


Z39.50 SCANZ39.50 SCANZ39.50 SCANZ39.50 SCAN

• Originally intended to support Browsing Originally intended to support Browsing • Query for Query for

– DatabaseDatabase– Attributes plus Term (i.e., index and start point)Attributes plus Term (i.e., index and start point)– Step SizeStep Size– Number of terms to retrieveNumber of terms to retrieve– Position in Response setPosition in Response set

• Results Results – Number of terms returnedNumber of terms returned– List of Terms and their frequency in the database (for List of Terms and their frequency in the database (for

the given attribute combination)the given attribute combination)


Z39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN Results% zscan title cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 27}{cat-fight 1}{catalan 19}{catalogu 37}{catalonia 8}{catalyt 2}{catania 1}{cataract 1}{catch 173}{catch-all 3}{catch-up 2} …

zscan topic cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 706}{cat-and-mouse 19}{cat-burglar 1}{cat-carrying 1}{cat-egory 1}{cat-fight 1}{cat-gut 1}{cat-litter 1}{cat-lovers 2}{cat-pee 1}{cat-run 1}{cat-scanners 1} …

Syntax: zscan indexname1 term stepsize number_of_terms pref_pos


MetaSearch Server Index MetaSearch Server Index CreationCreation

MetaSearch Server Index MetaSearch Server Index CreationCreation

• For all servers, or a topical subset…For all servers, or a topical subset…– Get Explain information (especially DC Get Explain information (especially DC

mappings)mappings)– For each index (or each DC index)For each index (or each DC index)

• Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency• Add term + freq + source index + database Add term + freq + source index + database

metadata to the metasearch “Collection metadata to the metasearch “Collection Document” (XML) Document” (XML)

– Planned extensions:Planned extensions:• Post-Process indexes (especially Geo Names, etc) Post-Process indexes (especially Geo Names, etc)

for special types of data for special types of data – e.g. create “geographical coverage” indexese.g. create “geographical coverage” indexes


MetaSearch ApproachMetaSearch ApproachMetaSearch ApproachMetaSearch Approach

MetaSearchServer

Map ExplainAnd ScanQueries

Internet

MapResults

MapQuery

MapResults

SearchEngine

DB2DB 1

MapQuery

MapResults

SearchEngine

DB 4DB 3

DistributedIndex

SearchEngine

Db 6Db 5


Known ProblemsKnown ProblemsKnown ProblemsKnown Problems

• Not all Z39.50 Servers support SCAN or Not all Z39.50 Servers support SCAN or ExplainExplain

• Solutions:Solutions:– Probing for attributes instead of explain (e.g. Probing for attributes instead of explain (e.g.

DC attributes or analogs)DC attributes or analogs)– We also support OAI and can extract OAI We also support OAI and can extract OAI

metadata for servers that support OAImetadata for servers that support OAI

• Collection Documents are static and need to Collection Documents are static and need to be replaced when the associated collection be replaced when the associated collection changeschanges


Evaluation Evaluation Evaluation Evaluation • Test EnvironmentTest Environment

– TREC Tipster and FT data (approx. 3.5 GB)TREC Tipster and FT data (approx. 3.5 GB)– Partitioned into 236 smaller collections based on source Partitioned into 236 smaller collections based on source

and (for TIPSTER) date by month (Distributed Search and (for TIPSTER) date by month (Distributed Search Testbed built by French, et al.)Testbed built by French, et al.)

• High size variability (Range from 1 to thousands of docs)High size variability (Range from 1 to thousands of docs)• 21,225,299 Words, 142,345,670 chars total for harvested records21,225,299 Words, 142,345,670 chars total for harvested records

• Efficiency (old data)Efficiency (old data)– Average of 23.07 seconds per database to SCAN each Average of 23.07 seconds per database to SCAN each

database (3.4 indexes on average)database (3.4 indexes on average)– Average of 14.07 seconds excluding FT (131 seconds for Average of 14.07 seconds excluding FT (131 seconds for

FT database with 7 indexes)FT database with 7 indexes)– Now collecting more information – so longer harvest times Now collecting more information – so longer harvest times

longer, but still under one minute on averagelonger, but still under one minute on average


EvaluationEvaluationEvaluationEvaluation

• Effectiveness Effectiveness – Still working on evaluation comparing our DB Still working on evaluation comparing our DB

ranking with the TIPSTER relevance ranking with the TIPSTER relevance judgementsjudgements

– Can be compared with published selection Can be compared with published selection methods (CORI, GlOSS, etc.) using the same methods (CORI, GlOSS, etc.) using the same testbedtestbed


FutureFutureFutureFuture

• Testing of variant algorithms for ranking Testing of variant algorithms for ranking collectionscollections

• Application to real systems and testing in a Application to real systems and testing in a production environment (Archives Hub)production environment (Archives Hub)

• Logically Clustering servers by topicLogically Clustering servers by topic

• Meta-Meta Servers (treating the Meta-Meta Servers (treating the MetaSearch database as just another MetaSearch database as just another database)database)


Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers

Replicatedservers

Meta-TopicalServers

General ServersDatabaseServers


ConclusionConclusionConclusionConclusion

• A lot of interesting work to be doneA lot of interesting work to be done– Redesign and development of the Cheshire II Redesign and development of the Cheshire II

systemsystem– Evaluating new meta-indexing methodsEvaluating new meta-indexing methods– Developing and Evaluating methods for Developing and Evaluating methods for

merging cross-domain results (or, perhaps, merging cross-domain results (or, perhaps, when to keep them separate)when to keep them separate)


Further InformationFurther InformationFurther InformationFurther Information

• Full Cheshire II client and server source is Full Cheshire II client and server source is available available ftp://cheshire.berkeley.edu/pub/cheshire/ftp://cheshire.berkeley.edu/pub/cheshire/– Includes HTML documentationIncludes HTML documentation– Also on Also on Berkeley Digital Library Software Berkeley Digital Library Software

Distribution CDDistribution CD

• Project Web Site Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/

august 22, 2001 nasa ames lecture -- ray r. larson xml structured document retrieval and distributed...

Documents

university of essex

university college london

california digital library

nasa ames lecture

participation fromuc

kings college london

distributed crossdomain

larson approachapproach