august 22, 2001 nasa ames lecture -- ray r. larson xml structured document retrieval and distributed...
Post on 20-Dec-2015
218 views
TRANSCRIPT
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
XML Structured Document XML Structured Document Retrieval and Distributed Resource Retrieval and Distributed Resource
DiscoveryDiscovery Ray R. LarsonRay R. Larson
School of Information Management & SystemsSchool of Information Management & SystemsUniversity of California, BerkeleyUniversity of California, Berkeley
[email protected]@sherlock.berkeley.edu
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
ContextContextContextContext
• NSF/JISC International Digital Library Grant NSF/JISC International Digital Library Grant – Cross-Domain Resource Discovery: Integrated Discovery and Use of Cross-Domain Resource Discovery: Integrated Discovery and Use of
Textual, Numeric and Spatial DataTextual, Numeric and Spatial Data• UC Berkeley DLI2 Grant: UC Berkeley DLI2 Grant:
– ReInventing Scholarly Information AccessReInventing Scholarly Information Access• UC Berkeley working with the University of Liverpool/Manchester UC Berkeley working with the University of Liverpool/Manchester
Computing with participation fromComputing with participation from– DeMontfort University (MASTER)DeMontfort University (MASTER)– Art and Humanities Data Service (http://ahds.ac.uk/)Art and Humanities Data Service (http://ahds.ac.uk/)
• OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & Northumbria) Northumbria)
– Consortium of University Research Libraries (CURL)Consortium of University Research Libraries (CURL)– UC Berkeley Library (and California Digital Library)UC Berkeley Library (and California Digital Library)
• Making of America IIMaking of America II• Online Archive of CaliforniaOnline Archive of California
– British Natural History Museum, LondonBritish Natural History Museum, London– NESSTAR (NEtworked Social Science Tools and Resources)NESSTAR (NEtworked Social Science Tools and Resources)
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Research AreasResearch AreasResearch AreasResearch Areas
• Goals are Goals are – Practical application of existing DL Practical application of existing DL
technologies to some large-scale cross-domain technologies to some large-scale cross-domain collectionscollections
– Theoretical examination and evaluation of next-Theoretical examination and evaluation of next-generation designs for systems architecture and generation designs for systems architecture and and distributed cross-domain searching for DLsand distributed cross-domain searching for DLs
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
ApproachApproachApproachApproach
• For the first goal, we are implementing a For the first goal, we are implementing a distributed search system based on distributed search system based on international standards (Z39.50 and international standards (Z39.50 and SGML/XML) using the Cheshire II information SGML/XML) using the Cheshire II information retrieval systemretrieval system
• Databases include:Databases include:– HE Archives hubHE Archives hub– Arts and Humanities Data Service (AHDS)Arts and Humanities Data Service (AHDS)– MASTERMASTER– CURL (Consortium of University Research CURL (Consortium of University Research
Libraries) Libraries) – Online Archive of California (OAC)Online Archive of California (OAC)– Making of America II (MOA2) Making of America II (MOA2)
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Current Usage of Cheshire IICurrent Usage of Cheshire IICurrent Usage of Cheshire IICurrent Usage of Cheshire II• Web clients for:Web clients for:
– Berkeley NSF/NASA/ARPA Digital Library Berkeley NSF/NASA/ARPA Digital Library – World Conservation Digital LibraryWorld Conservation Digital Library– SunSite (UC Berkeley Science Libraries)SunSite (UC Berkeley Science Libraries)– University of LiverpoolUniversity of Liverpool– Higher Education Archives HubHigher Education Archives Hub
• Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded)Studies, Manchester, Southhampton, Warwick and others (to be expanded)
– University of Essex, HDS (part of AHDS)University of Essex, HDS (part of AHDS)– Oxford Text Archive (test only)Oxford Text Archive (test only)– California Sheet Music ProjectCalifornia Sheet Music Project– Cha-Cha (Berkeley Intranet Search Engine)Cha-Cha (Berkeley Intranet Search Engine)– Berkeley Metadata project cross-language demoBerkeley Metadata project cross-language demo– Univ. of Virginia (test implementations)Univ. of Virginia (test implementations)– Cheshire ranking algorithm is basis for original InktomiCheshire ranking algorithm is basis for original Inktomi
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Current and Upcoming Usage of Current and Upcoming Usage of Cheshire IICheshire II
Current and Upcoming Usage of Current and Upcoming Usage of Cheshire IICheshire II
• DIEPER Digitized European Periodicals project. DIEPER Digitized European Periodicals project. – http://gdz.sub.uni-goettingen.de/dieper/http://gdz.sub.uni-goettingen.de/dieper/
• NESSTAR (Networked Social Science Tools and NESSTAR (Networked Social Science Tools and Resources. Resources. – http://www.nesstar.org/ http://www.nesstar.org/
• FASTER – Flexible Access to Statistics Tables and FASTER – Flexible Access to Statistics Tables and Electronic Resources. (Continuation of NESSTAR)Electronic Resources. (Continuation of NESSTAR)– http://www.faster-data.org/ http://www.faster-data.org/
• MASTER (Manuscript Access through Standards for MASTER (Manuscript Access through Standards for Electronic Records. Electronic Records. – http://www.cta.dmu.ac.uk/projects/master/http://www.cta.dmu.ac.uk/projects/master/
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Upcoming Usage of Cheshire IIUpcoming Usage of Cheshire IIUpcoming Usage of Cheshire IIUpcoming Usage of Cheshire II• ZETOC (Prototype of the Electronic Table of Contents ZETOC (Prototype of the Electronic Table of Contents
from the British Library)from the British Library)– http://zetoc.mimas.ac.uk/http://zetoc.mimas.ac.uk/
• Archives HubArchives Hub– http://www.archiveshub.ac.uk/http://www.archiveshub.ac.uk/
• RSLP Palaeography projectRSLP Palaeography project– http://www.palaeography.ac.uk/http://www.palaeography.ac.uk/
• British Natural History Museum, London British Natural History Museum, London • JISC data services directory hosted by MIMAS JISC data services directory hosted by MIMAS • Resource Discovery Network (RDN), where it will be Resource Discovery Network (RDN), where it will be
used to harvest RDN records from the various hubs using used to harvest RDN records from the various hubs using OAI and provide searchOAI and provide search
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Client/Server ArchitectureClient/Server ArchitectureClient/Server ArchitectureClient/Server Architecture
• Server Supports:Server Supports:– Database storageDatabase storage– Indexing Indexing – Z39.50 access to local dataZ39.50 access to local data– Boolean and Probabilistic SearchingBoolean and Probabilistic Searching– Relevance FeedbackRelevance Feedback– External SQL database supportExternal SQL database support
• Client Supports:Client Supports:– Programmable (Tcl/Tk) Graphical User InterfaceProgrammable (Tcl/Tk) Graphical User Interface– Z39.50 access to remote serversZ39.50 access to remote servers– SGML/XML & MARC formattingSGML/XML & MARC formatting
• Combined Client/Server CGI scripting via WebCheshire used Combined Client/Server CGI scripting via WebCheshire used for web applicationsfor web applications
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support
• Underlying native format for all data is SGML/XMLUnderlying native format for all data is SGML/XML• The DTD defines the file format for each fileThe DTD defines the file format for each file• Full SGML/XML parsingFull SGML/XML parsing• XML Configuration Files define the databaseXML Configuration Files define the database• USMARC DTD and MARC to SGML conversion USMARC DTD and MARC to SGML conversion
(and back again)(and back again)• Access to full-text via special SGML tagsAccess to full-text via special SGML tags• Support for SGML/XML component definition and Support for SGML/XML component definition and
indexingindexing
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support• Example XML record for a DL documentExample XML record for a DL document
<ELIB-BIB><BIB-VERSION>ELIB-v1.0</BIB-VERSION><ID>756</ID><ENTRY>June 12, 1996</ENTRY><DATE>June 1996</DATE><TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE><ORGANIZATION>University of California</ORGANIZATION><TYPE>report</TYPE><AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL><AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL><PROJECT>SNEP</PROJECT><SERIES>Vol 3</SERIES><PAGES>40</PAGES><TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF><PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF></ELIB-BIB>
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support• Example SGML/MARC RecordExample SGML/MARC Record
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
SGML/XML SupportSGML/XML SupportSGML/XML SupportSGML/XML Support
• Configuration files for the Server are also Configuration files for the Server are also SGML/XML:SGML/XML:– They include tags describing all of the data files They include tags describing all of the data files
and indexes for the database.and indexes for the database.– They also include instructions on how data is to They also include instructions on how data is to
be extracted for indexing and how Z39.50 be extracted for indexing and how Z39.50 attributes map to the indexes for a given attributes map to the indexes for a given database.database.
– They include definition of components and They include definition of components and component indexescomponent indexes
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Component Extraction and Component Extraction and RetrievalRetrieval
Component Extraction and Component Extraction and RetrievalRetrieval
• Any sub-elements of an SGML/XML document Any sub-elements of an SGML/XML document can be defined as a separately indexed can be defined as a separately indexed “component”. “component”.
• Components can be ranked and retrieved Components can be ranked and retrieved independently of the source document (but linked independently of the source document (but linked back to their original source)back to their original source)
• For example paragraphs and abstracts in the full For example paragraphs and abstracts in the full text of documents could be defined as components text of documents could be defined as components to provide paragraph-level searchto provide paragraph-level search
• Example: Glassier archives…Example: Glassier archives…
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Component Extraction and Component Extraction and RetrievalRetrieval
Component Extraction and Component Extraction and RetrievalRetrieval
• The Glassier archive is an EAD document The Glassier archive is an EAD document (1.9 Mb in size)(1.9 Mb in size)
• Contains “Series, Subseries, and Item level” Contains “Series, Subseries, and Item level” descriptions of things in the archivedescriptions of things in the archive
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Excerpt from Glasier ArchiveExcerpt from Glasier ArchiveExcerpt from Glasier ArchiveExcerpt from Glasier Archive<c level="subseries"><did><head>GP-1-1: General correspondence. Public letters.</head><unitid id="gp-1-1">GP-1-1</unitid><unittitle>Glasier Papers. General correspondence. Public letters.</unittitle></did><arrangement><head>Arrangement </head><p>Public letters arranged alphabetically within each year </p></arrangement><c level="item" langmaterial="eng"><did><unitid id="gp-1-1-0001">GP-1-1-0001</unitid><unittitle>Letter from Richard Murray. <geogname>Glasgow</geogname>; <unitdate>7 Apr 1879</unitdate>.</unittitle><origination><persname>Murray, Richard</persname></origination><physdesc><extent>1 letter</extent></physdesc></did><note><p>Employment reference for J.B.G. as draughtsman<subject>Glasier, JohnBruce</subject></p></note></c>
ETC….
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Example Component DefExample Component DefExample Component DefExample Component Def…<COMPONENTS><COMPONENTDEF><COMPONENTNAME>
/home/ray/Work/Glasier_test/indexes/COMPONENT_DB1 </COMPONENTNAME>
<COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><tagspec><FTAG> c </FTAG><ATTR> level <VALUE>item</VALUE></ATTR></tagspec></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def -->…
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
ComponentsComponentsComponentsComponents
• Both individual tags and “ranges” with a Both individual tags and “ranges” with a starting tag and (different) ending tag can starting tag and (different) ending tag can be used as componentsbe used as components
• Components permit parts of complex Components permit parts of complex SGML/XML documents to be treated as SGML/XML documents to be treated as separate documentsseparate documents
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Cheshire II SearchingCheshire II SearchingCheshire II SearchingCheshire II Searching
Z39.50 Internet
ImagesScannedText
Local Remote
Z39.50
Z39.50
Z39.50
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Boolean Search CapabilityBoolean Search CapabilityBoolean Search CapabilityBoolean Search Capability
• All Boolean operations are supportedAll Boolean operations are supported– ““zfind author x and (title y or subject z) not subject A”zfind author x and (title y or subject z) not subject A”
• Named sets are supported and stored on the serverNamed sets are supported and stored on the server• Boolean operations between stored sets are Boolean operations between stored sets are
supportedsupported– ““zfind SET1 and subject widgets or SET2”zfind SET1 and subject widgets or SET2”
• Nested parentheses and truncation are supportedNested parentheses and truncation are supported– ““zfind xtitle Alice#”zfind xtitle Alice#”
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Probabilistic ModelsProbabilistic ModelsProbabilistic ModelsProbabilistic Models
• Rigorous formal model attempts to predict Rigorous formal model attempts to predict the probability that a given document will the probability that a given document will be relevant to a given querybe relevant to a given query
• Ranks retrieved documents according to Ranks retrieved documents according to this probability of relevance (Probability this probability of relevance (Probability Ranking Principle)Ranking Principle)
• Rely on accurate estimates of probabilitiesRely on accurate estimates of probabilities
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Probability Ranking PrincipleProbability Ranking PrincipleProbability Ranking PrincipleProbability Ranking Principle
• If a reference retrieval system’s response to each If a reference retrieval system’s response to each request is a ranking of the documents in the request is a ranking of the documents in the collections in the order of decreasing probability collections in the order of decreasing probability of usefulness to the user who submitted the of usefulness to the user who submitted the request, where the probabilities are estimated as request, where the probabilities are estimated as accurately as possible on the basis of whatever accurately as possible on the basis of whatever data has been made available to the system for this data has been made available to the system for this purpose, then the overall effectiveness of the purpose, then the overall effectiveness of the system to its users will be the best that is system to its users will be the best that is obtainable on the basis of that data.obtainable on the basis of that data.
Stephen E. Robertson, J. Documentation 1977
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Probabilistic Models: Logistic Probabilistic Models: Logistic RegressionRegression
Probabilistic Models: Logistic Probabilistic Models: Logistic RegressionRegression
• Estimates for relevance based on log-linear Estimates for relevance based on log-linear model with various statistical measures of model with various statistical measures of document content as independent variables.document content as independent variables.
nnkji vcvcvcctdR|qO ...),,(log 22110
)),|(log(1
1),|(
ji dqROjie
dqRP
m
kkjiji ROtdqROdqRO
1, )](log),|([log),|(log
Log odds of relevance is a linear function of attributes:
Term contributions summed:
Probability of Relevance is inverse of log odds:
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Logistic RegressionLogistic RegressionLogistic RegressionLogistic Regression
100 -90 -80 -70 -60 -50 -40 -30 -20 -10 -0 -
0 10 20 30 40 50 60Term Frequency in Document
Rel
evan
ce
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression
6
10),|(
iii XccDQRP
Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:
For the 6 X attribute measures shown on the next slide
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document -- logged
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Cheshire Probabilistic RetrievalCheshire Probabilistic RetrievalCheshire Probabilistic RetrievalCheshire Probabilistic Retrieval
• Uses Logistic Regression ranking method developed at Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at Berkeley with new algorithm for weigh calculation at retrieval time.retrieval time.
• Z39.50 “relevance” operator used to indicate Z39.50 “relevance” operator used to indicate probabilistic searchprobabilistic search
• Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed:– zfind topic @ “cheshire cats, looking glasses, march hares zfind topic @ “cheshire cats, looking glasses, march hares
and other such things”and other such things”– zfind title @ caucus raceszfind title @ caucus races
• Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined:– zfind topic @ government documents and title guidebookszfind topic @ government documents and title guidebooks
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Combining Search TypesCombining Search TypesCombining Search TypesCombining Search Types• It is also possible to combine the results of multiple It is also possible to combine the results of multiple
independent searches into a single result set. (using the independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system) Z39.50 SORT service of the Cheshire system) – E.g.:E.g.:– Search of Full Text (Probabilistic)Search of Full Text (Probabilistic)– Search of Full Text (Boolean)Search of Full Text (Boolean)– Search of Components (Probabilistic)Search of Components (Probabilistic)– Search of Titles (Probabilistic)Search of Titles (Probabilistic)– Search of Subject Headings (Probabilistic)Search of Subject Headings (Probabilistic)
• All result sets are merged and re-ranked to produce the All result sets are merged and re-ranked to produce the final list.final list.
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Relevance Feedback.Relevance Feedback.Relevance Feedback.Relevance Feedback.
• Any records in a result set can be used for Any records in a result set can be used for Relevance FeedbackRelevance Feedback
• Uses the “set name” to receive feedback Uses the “set name” to receive feedback instructions.instructions.– zfind SET1:2,5-9,30,45zfind SET1:2,5-9,30,45– zfind SET2:6zfind SET2:6
• Chosen records are used to build a new Chosen records are used to build a new probabilistic queryprobabilistic query
• Ranked results are returnedRanked results are returned• Planned support for (modified) Rocchio RFPlanned support for (modified) Rocchio RF
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Cheshire II - Two-Stage Cheshire II - Two-Stage Retrieval (EVM generation)Retrieval (EVM generation)
• Example: Using the LC Classification SystemExample: Using the LC Classification System– Pseudo-Document created for each LC class containing terms Pseudo-Document created for each LC class containing terms
derived from “content-rich” portions of documents in that derived from “content-rich” portions of documents in that class (subject headings, titles, etc.)class (subject headings, titles, etc.)
– Permits searching by Permits searching by any term in the classany term in the class– Ranked Probabilistic retrieval techniques attempt to present Ranked Probabilistic retrieval techniques attempt to present
the “Best Matches” to a query first.the “Best Matches” to a query first.– User selects classes to feed back for the “second stage” search User selects classes to feed back for the “second stage” search
of documents (which includes info from first stage selections)of documents (which includes info from first stage selections)
• Can be used with any classified/Indexed collection and Can be used with any classified/Indexed collection and controlled vocabularycontrolled vocabulary
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Automatic Class AssignmentAutomatic Class Assignment
DocDoc
DocDoc
DocDoc
Doc
SearchEngine
1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category
Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Cheshire II - Cluster GenerationCheshire II - Cluster GenerationCheshire II - Cluster GenerationCheshire II - Cluster Generation
• Define Define basisbasis for clustering records. for clustering records.– Select field to form the Select field to form the basis basis of the cluster.of the cluster.– EvidenceEvidence Fields to use as contents of the pseudo- Fields to use as contents of the pseudo-
documents.documents.• During indexing cluster keys are generated with During indexing cluster keys are generated with
basisbasis and and evidenceevidence from each record. from each record.• Cluster keys are sorted and merged on basis and Cluster keys are sorted and merged on basis and
pseudo-documents created for each unique pseudo-documents created for each unique basisbasis element containing all evidence fields.element containing all evidence fields.
• Pseudo-Documents (Class clusters) are indexed on Pseudo-Documents (Class clusters) are indexed on combined evidence fields.combined evidence fields.
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Cheshire II - Two-Stage Cheshire II - Two-Stage RetrievalRetrieval
• Using the Mesh Subject Heading SystemUsing the Mesh Subject Heading System– Pseudo-Document created for each MESH heading containing Pseudo-Document created for each MESH heading containing
terms derived from “content-rich” portions of documents in terms derived from “content-rich” portions of documents in that class (other subject headings, titles, abstract, etc.)that class (other subject headings, titles, abstract, etc.)
– Permits searching by Permits searching by any term in the classany term in the class– Ranked Probabilistic retrieval techniques attempt to present Ranked Probabilistic retrieval techniques attempt to present
the “Best Matches” to a query first.the “Best Matches” to a query first.– User selects classes to feed back for the “second stage” search User selects classes to feed back for the “second stage” search
of documents.of documents.
• Can be used with any classified/Indexed collection.Can be used with any classified/Indexed collection.
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Distributed Search: The Distributed Search: The ProblemProblem
Distributed Search: The Distributed Search: The ProblemProblem
• Hundreds or Thousands of servers with Hundreds or Thousands of servers with databases ranging widely in content, topic, databases ranging widely in content, topic, formatformat– Broadcast search is expensive in terms of Broadcast search is expensive in terms of
bandwidth and in processing too many bandwidth and in processing too many irrelevant resultsirrelevant results
– How to select the “best” ones to search?How to select the “best” ones to search?• What to search firstWhat to search first• Which to search nextWhich to search next
– Topical /domain constraints on the search Topical /domain constraints on the search selectionsselections
– Variable contents of database (metadata only, Variable contents of database (metadata only, full text…)full text…)
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
An Approach for Cross-Domain An Approach for Cross-Domain Resource DiscoveryResource Discovery
An Approach for Cross-Domain An Approach for Cross-Domain Resource DiscoveryResource Discovery
• MetaSearchMetaSearch– New approach to building metasearch based on Z39.50New approach to building metasearch based on Z39.50– Instead of using Instead of using broadcastbroadcast search we are using two Z39.50 search we are using two Z39.50
ServicesServices• Identification of database metadata using Z39.50 Identification of database metadata using Z39.50 ExplainExplain• Extraction of distributed indexes using Z39.50 Extraction of distributed indexes using Z39.50 SCANSCAN
• Evaluation Evaluation – How efficiently can we build distributed indexes? How efficiently can we build distributed indexes? – How effectively can we choose databases using the index?How effectively can we choose databases using the index?– How effective is merging search results from multiple How effective is merging search results from multiple
sources?sources?– Hierarchies of servers (general/meta-topical/individual)?Hierarchies of servers (general/meta-topical/individual)?
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Z39.50 OverviewZ39.50 OverviewZ39.50 OverviewZ39.50 Overview
UI
UI
MapQuery
Internet
MapResults
MapQuery
MapResults
MapQuery
MapResults
SearchEngine
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Z39.50 ExplainZ39.50 ExplainZ39.50 ExplainZ39.50 Explain
• Explain supports searches for Explain supports searches for – Server-Level metadata Server-Level metadata
• Server NameServer Name
• IP AddressesIP Addresses
• Ports Ports
– Database-Level metadataDatabase-Level metadata• Database nameDatabase name
• Search attributes (indexes and combinations) Search attributes (indexes and combinations)
– Support metadata (record syntaxes, etc)Support metadata (record syntaxes, etc)
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Z39.50 SCANZ39.50 SCANZ39.50 SCANZ39.50 SCAN
• Originally intended to support Browsing Originally intended to support Browsing • Query for Query for
– DatabaseDatabase– Attributes plus Term (i.e., index and start point)Attributes plus Term (i.e., index and start point)– Step SizeStep Size– Number of terms to retrieveNumber of terms to retrieve– Position in Response setPosition in Response set
• Results Results – Number of terms returnedNumber of terms returned– List of Terms and their frequency in the database (for List of Terms and their frequency in the database (for
the given attribute combination)the given attribute combination)
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Z39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN Results% zscan title cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 27}{cat-fight 1}{catalan 19}{catalogu 37}{catalonia 8}{catalyt 2}{catania 1}{cataract 1}{catch 173}{catch-all 3}{catch-up 2} …
zscan topic cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 706}{cat-and-mouse 19}{cat-burglar 1}{cat-carrying 1}{cat-egory 1}{cat-fight 1}{cat-gut 1}{cat-litter 1}{cat-lovers 2}{cat-pee 1}{cat-run 1}{cat-scanners 1} …
Syntax: zscan indexname1 term stepsize number_of_terms pref_pos
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
MetaSearch Server Index MetaSearch Server Index CreationCreation
MetaSearch Server Index MetaSearch Server Index CreationCreation
• For all servers, or a topical subset…For all servers, or a topical subset…– Get Explain information (especially DC Get Explain information (especially DC
mappings)mappings)– For each index (or each DC index)For each index (or each DC index)
• Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency• Add term + freq + source index + database Add term + freq + source index + database
metadata to the metasearch “Collection metadata to the metasearch “Collection Document” (XML) Document” (XML)
– Planned extensions:Planned extensions:• Post-Process indexes (especially Geo Names, etc) Post-Process indexes (especially Geo Names, etc)
for special types of data for special types of data – e.g. create “geographical coverage” indexese.g. create “geographical coverage” indexes
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
MetaSearch ApproachMetaSearch ApproachMetaSearch ApproachMetaSearch Approach
MetaSearchServer
Map ExplainAnd ScanQueries
Internet
MapResults
MapQuery
MapResults
SearchEngine
DB2DB 1
MapQuery
MapResults
SearchEngine
DB 4DB 3
DistributedIndex
SearchEngine
Db 6Db 5
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Known ProblemsKnown ProblemsKnown ProblemsKnown Problems
• Not all Z39.50 Servers support SCAN or Not all Z39.50 Servers support SCAN or ExplainExplain
• Solutions:Solutions:– Probing for attributes instead of explain (e.g. Probing for attributes instead of explain (e.g.
DC attributes or analogs)DC attributes or analogs)– We also support OAI and can extract OAI We also support OAI and can extract OAI
metadata for servers that support OAImetadata for servers that support OAI
• Collection Documents are static and need to Collection Documents are static and need to be replaced when the associated collection be replaced when the associated collection changeschanges
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Evaluation Evaluation Evaluation Evaluation • Test EnvironmentTest Environment
– TREC Tipster and FT data (approx. 3.5 GB)TREC Tipster and FT data (approx. 3.5 GB)– Partitioned into 236 smaller collections based on source Partitioned into 236 smaller collections based on source
and (for TIPSTER) date by month (Distributed Search and (for TIPSTER) date by month (Distributed Search Testbed built by French, et al.)Testbed built by French, et al.)
• High size variability (Range from 1 to thousands of docs)High size variability (Range from 1 to thousands of docs)• 21,225,299 Words, 142,345,670 chars total for harvested records21,225,299 Words, 142,345,670 chars total for harvested records
• Efficiency (old data)Efficiency (old data)– Average of 23.07 seconds per database to SCAN each Average of 23.07 seconds per database to SCAN each
database (3.4 indexes on average)database (3.4 indexes on average)– Average of 14.07 seconds excluding FT (131 seconds for Average of 14.07 seconds excluding FT (131 seconds for
FT database with 7 indexes)FT database with 7 indexes)– Now collecting more information – so longer harvest times Now collecting more information – so longer harvest times
longer, but still under one minute on averagelonger, but still under one minute on average
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
EvaluationEvaluationEvaluationEvaluation
• Effectiveness Effectiveness – Still working on evaluation comparing our DB Still working on evaluation comparing our DB
ranking with the TIPSTER relevance ranking with the TIPSTER relevance judgementsjudgements
– Can be compared with published selection Can be compared with published selection methods (CORI, GlOSS, etc.) using the same methods (CORI, GlOSS, etc.) using the same testbedtestbed
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
FutureFutureFutureFuture
• Testing of variant algorithms for ranking Testing of variant algorithms for ranking collectionscollections
• Application to real systems and testing in a Application to real systems and testing in a production environment (Archives Hub)production environment (Archives Hub)
• Logically Clustering servers by topicLogically Clustering servers by topic
• Meta-Meta Servers (treating the Meta-Meta Servers (treating the MetaSearch database as just another MetaSearch database as just another database)database)
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers
Replicatedservers
Meta-TopicalServers
General ServersDatabaseServers
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
ConclusionConclusionConclusionConclusion
• A lot of interesting work to be doneA lot of interesting work to be done– Redesign and development of the Cheshire II Redesign and development of the Cheshire II
systemsystem– Evaluating new meta-indexing methodsEvaluating new meta-indexing methods– Developing and Evaluating methods for Developing and Evaluating methods for
merging cross-domain results (or, perhaps, merging cross-domain results (or, perhaps, when to keep them separate)when to keep them separate)
August 22, 2001 NASA Ames Lecture -- Ray R. Larson
Further InformationFurther InformationFurther InformationFurther Information
• Full Cheshire II client and server source is Full Cheshire II client and server source is available available ftp://cheshire.berkeley.edu/pub/cheshire/ftp://cheshire.berkeley.edu/pub/cheshire/– Includes HTML documentationIncludes HTML documentation– Also on Also on Berkeley Digital Library Software Berkeley Digital Library Software
Distribution CDDistribution CD
• Project Web Site Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/