11/15/2001database management -- spring 2001 -- r. larson object-relational database applications --...
Post on 21-Dec-2015
219 views
TRANSCRIPT
![Page 1: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/1.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Object-Relational DatabaseApplications -- The UC Berkeley
Environmental Digital LibraryUniversity of California, Berkeley
School of Information Management and Systems
SIMS 257: Database Management
![Page 2: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/2.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Today
• Object Relational Database Applications– The Berkeley Digital Library Project
• Slides from RRL and Robert Wilensky, EECS
– Use of DBMS in DL project.
![Page 3: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/3.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Final Presentations and Reports
• Specifications for final report are on the Web Site under assignments
• Presentations (Nov 27th & 30th , Dec 4th and 6th)– Signup sheet being passed around.
•
![Page 4: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/4.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Today
• Object Relational Applications
• The UCB Digital Library
![Page 5: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/5.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Overview• What is an Digital Library?
• Overview of Ongoing Research on Information Access in Digital Libraries
![Page 6: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/6.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Digital Libraries Are Like Traditional Libraries...
• Involve large repositories of information (storage, preservation, and access)
• Provide information organization and retrieval facilities (categorization, indexing)
• Provide access for communities of users (communities may be as large as the general public or small as the employees of a particular organization)
![Page 7: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/7.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Originators
Libraries
Users
Traditional Library System
![Page 8: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/8.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
But Digital Libraries Are Different From Libraries...
• Not a physical location with local copies; objects held closer to originators
• Decoupling of storage, organization, access
• Enhanced Authoring (origination, annotation, support for work groups)
• Subscription, pay-per-view supported in addition to “free” browsing.
• Integration into user tasks.
![Page 9: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/9.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Originators
Repositories
Users
A Digital Library Infrastructure Model
Index Services
Network
![Page 10: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/10.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
UC Berkeley Digital Library Project
• Foci: Work-centered digital information services and Re-Inventing Scholarly Information
• Testbed: Digital Library for the California Environment
• Research: Technical agenda supporting user-oriented access to large distributed collections of diverse data types.
• Part of the NSF/NASA/DARPA Digital Library Initiative (Phases 1 and 2, and the International DL initiative)
![Page 11: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/11.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
UCB Digital Library Project: Research Organizations
• UC Berkeley EECS, SIMS, CED, IS&T• UCOP• Xerox PARC’s Document Image Decoding group and Work
Practices group• Hewlett-Packard• NEC • SUN Microsystems• IBM Almaden• Microsoft• Ricoh California Research• Philips Research
![Page 12: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/12.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
• Collection: Diverse material relevant to California’s key habitats.
• Users: A consortium of state agencies, development corporations, private corporations, regional government alliances, educational institutions, and libraries.
• Potential: Impact on state-wide environmental system (CERES )
Testbed: An Environmental Digital Library
![Page 13: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/13.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
The Environmental Library -Users/Contributors
• California Resources Agency, California Environment Resources Evaluation System (CERES)
• California Department of Water Resources• The California Department of Fish & Game• SANDAG• UC Water Resources Center Archives• New Partners: CDL and SDSC
![Page 14: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/14.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
The Environmental Library - Contents
• Environmental technical reports, bulletins, etc.• County general plans• Aerial and ground photography• USGS topographic maps• Land use and other special purpose maps• Sensor data• “Derived” information• Collection data bases for the classification and distribution
of the California biota (e.g., SMASCH)• Supporting 3-D, economic, traffic, etc. models• Videos collected by the California Resources Agency
![Page 15: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/15.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
The Environmental Library - Contents
• As of late 2000, the collection represents about one terabyte of data, including over 165,000 digital images, about 300,000 pages of environmental documents, and nearly 2 million records in geographical and botanical databases.
![Page 16: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/16.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Botanical Data: The CalFlora Database contains taxonomical
and distribution information for more than 8000 native California plants. The Occurrence Database includes over 600,000 records of California plant sightings from many federal, state, and private sources. The botanical databases are linked to our CalPhotos collection of Calfornia plants, and are also linked to external collections of data, maps, and photos.
![Page 17: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/17.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Geographical Data:
Much of the geographical data in our collection is being used to develop our web-based GIS Viewer. The Street Finder uses 500,000 Tiger records of S.F. Bay Area streets along with the 70,000-records from the USGS GNIS database. California Dams is a database of information about the 1395 dams under state jurisdiction. An additional 11 GB of geographical data represents maps and imagery that have been processed for inclusion as layers in our GIS Viewer. This includes Digital Ortho Quads and DRG maps for the S.F. Bay Area.
![Page 18: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/18.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Documents:
Most of the 300,000 pages of digital documents are environmental reports and plans that were provided by California state agencies. This collection includes documents, maps, articles, and reports on the California environment including Environmental Impact Reports (EIRs), educational pamphlets, water usage bulletins, and county plans. Documents in this collection come from the California Department of Water Resources (DWR), California Department of Fish and Game (DFG), San Diego Association of Governments (SANDAG), and many other agencies. Among the most frequently accessed documents are County General Plans for every California county and a survey of 125 Sacramento Delta fish species.
![Page 19: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/19.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Documents - cont.
The collection also includes about 20Mb of full-text (HTML) documents from the World Conservation Digital Library. In addition to providing online access to important environmental documents, the document collection is the testbed for our Multivalent Document research.
![Page 20: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/20.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Image Data
• The photo collection includes over 17,000 images of California natural resources from the state Department of Water Resources, several hundred aerial photos, over 17,000 photos of California native plants from St. Mary's College, the California Academy of Science, and others, a small collection of California animals, and 40,000 Corel stock photos. These images are used within the project for computer vision research
![Page 21: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/21.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Testbed Success Stories• LUPIN: CERES’ Land Use Planning Information Network
– California Country General Plans and other environmental documents.
– Enter at Resources Agency Server, documents stored at and retrieved from UCB DLIB server.
• California flood relief efforts– High demand for some data sets only available on our server
(created by document recognition).
• CalFlora: Creation and interoperation of repositories pertaining to plant biology.
• Cloning of services at Cal State Library, FBI
![Page 22: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/22.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Research Highlights
• Documents– Multivalent Document prototype
• Page images, structured documents, GIS data, photographs
• Intelligent Access to Content– Document recognition
– Vision-based Image Retrieval: stuff, thing, scene retrieval
– Natural Language Processing: categorizing the web, Cheshire II, TileBar Interfaces
![Page 23: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/23.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Multivalent Documents
• MVD Model– radically distributed, open, extensible– “behaviors” and “layers”
• behaviors conform to a protocol suite
• inter-operation via “IDEG”
• Applied to “enlivening legacy documents”– various nice behaviors, e.g., lenses
![Page 24: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/24.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Document Presentation• Problem: Digital libraries must deliver digital
documents -- but in what form?• Different forms have advantages for particular
purposes– Retrieval– Reuse– Content Analysis– Storage and archiving
• Combining forms (Multivalent documents)
![Page 25: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/25.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Spectrum of Digital Document Representations
Adapted from Fox, E.A., et al. “Users, User Interfaces and Objects: Evision, an Electronic Library”, JASIS 44(8), 1993
![Page 26: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/26.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Document Representation: Multivalent Documents
• Primary user interface/document model for UCB Digital Library (Wilensky & Phelps)
• Goal: An approach to new document representations and their authoring.
• Supports active, distributed, composable transformations of multimedia documents.
• Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.
![Page 27: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/27.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Multivalent DocumentsCheshire LayerCheshire Layer
OCR LayerOCR Mapping LayerHistory of The Classical World
The jsfj sjjhfjs jsjjjsjhfsjf sjhfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsjksfksjfkskflk sjfjksfkjsfkjsfkjshf sjfsjfjksksfjksfjksjfkthsjir\\ksksfjksjfkksjkls’ksklsjfkskfksjjjhsjhuusfsjfkjs
Modernjsfj sjjhfjs jsjjjsjhfsjf sslfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsj
GIS Layer
taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl
taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl
Table 1.
Table Layer
kdkdkdkdk Scanned
PageImage
Valence:2: The relativecapacity to unite,react, or interact(as with antigensor a biologicalsubstrate).
Webster’s 7th CollegiateDictionary
Network Protocols &Resources
![Page 28: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/28.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
![Page 29: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/29.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
![Page 30: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/30.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
MVD Third Party Work
• Japanese support by NEC; application to office document management
• Printing, support for other OCR formats, by HP
• Chinese character and multilingual lens by UCB Instructional Support staff (Owen McGrath)
• Automatic enlivening of documents via Transcend proxy.
![Page 31: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/31.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
MVD Forthcoming
• Support for XML + style sheets• More robust parsing• Saving where you want• Media adaptors for
– Continuous media– Near image formats, word proc. formats
• Improve authoring tools• Interoperation with paper• Application versus applet?• Release to community, get feedback, iterate.
![Page 32: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/32.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
GIS in the MVD Framework• Layers are georeferenced data sets.• Behaviors are
– display semi-transparently– pan– zoom– issue query– display context– “spatial hyperlinks”– annotations
• Written in Java (to be merged with MVD-1 code line?)
![Page 33: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/33.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
GIS Viewer: Recent Developments
• Annotation and saving– points, rectangles (w. labels and links), vectors – saving of annotations as separate layer
• Integration with address, street finding, gazetteer services
• Application to image viewing: tilePix• Castanet client
![Page 34: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/34.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
![Page 35: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/35.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
![Page 36: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/36.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
![Page 37: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/37.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html
![Page 38: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/38.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Geographic Information: Plans and Ideas
• More annotations, flexible saving• Support for large vector data sets• Interoperability
– On-the-fly • conversion of formats
• generation of “catalogs”
– Via OGDI/GLTP
– Experimenting with various CERES servers
![Page 39: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/39.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Documents: Information from scanned document
• Built document recognizers for some important documents, e.g. “Bulletin 17”. “TR-9”.
• Recognized document structure, with order magnitude better OCR.
• Automatically generated 1395 item dam relational data base.
• Enabled access via forms, map interfaces.• Enable interoperation with image DB.
![Page 40: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/40.jpg)
![Page 41: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/41.jpg)
![Page 42: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/42.jpg)
![Page 43: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/43.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Document Recognition: Future Plans
• Document recognizers: for ~ dozen document types
• Development and integration of mathematical OCR and recognition.
• Eventually produce document recognizer generator, i.e., make it easier to write recognizers.
![Page 44: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/44.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Vision-Based Image Retrieval
• Stuff-based queries: “blobs”– Basic blobs: colors, sizes, variable number
• demonstrated utility for interesting queries
– “Blob world”: Above plus texture, applied to• retrieving similar images• successful learning scene classifier
• Thing-finding: Successfully deployed detectors adding body plans (adding shape, geometry and kinematic constraints)
Find objects by grouping coherent low-level properties
![Page 45: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/45.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Image Retrieval Research
• Finding “Stuff” vs “Things”
• BlobWorld
• Other Vision Research
![Page 46: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/46.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
(Old “stuff”-based image retrieval: Query)
![Page 47: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/47.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
(Old “stuff”-based image retrieval: Result)
![Page 48: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/48.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Blobworld: use regions for retrieval
• We want to find general objects Represent images based on coherent regions
![Page 49: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/49.jpg)
![Page 50: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/50.jpg)
![Page 51: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/51.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
(“Thing”-based image retrieval using “body plans”: Result)
![Page 52: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/52.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Natural Language ProcessingAutomatic Topic Assignment
• Developed automatic categorization/disambiguation method to point where topic assignment (but not disambiguation) appears feasible.
• Ran controlled experiment:– Took Yahoo as ground truth.– Chose 9 overlapping categories; took 1000 web pages
from Yahoo as input.– Result: 84% precision; 48% recall (using top 5 of 1073
categories)
![Page 53: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/53.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Distributed Resource Discovery and Structured Data Searching
With Cheshire II
Ray R. LarsonSchool of Information Management & Systems
University of California, [email protected]
![Page 54: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/54.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Research Areas
• Goals are – Practical application of existing Digital Library
technologies to some large-scale cross-domain collections
• Evaluation of distributed search in cross-domain environment
– Theoretical examination and evaluation of next-generation designs for systems architecture and and distributed cross-domain searching for DLs
![Page 55: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/55.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Approach
• For the first goal, we are implementing a distributed search system based on international standards (Z39.50 and SGML/XML) using the Cheshire II information retrieval system
• Databases include:– HE Archives hub– Arts and Humanities Data Service (AHDS)– MASTER– CURL (Consortium of University Research Libraries) – Online Archive of California (OAC)– Making of America II (MOA2)
![Page 56: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/56.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Current Usage of Cheshire II• Web clients for:
– Berkeley NSF/NASA/ARPA Digital Library – World Conservation Digital Library– SunSite (UC Berkeley Science Libraries)– University of Liverpool– Higher Education Archives Hub
• Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded)
– University of Essex, HDS (part of AHDS)– Oxford Text Archive (test only)– California Sheet Music Project– Cha-Cha (Berkeley Intranet Search Engine)– Berkeley Metadata project cross-language demo– Univ. of Virginia (test implementations)– Cheshire ranking algorithm is basis for original Inktomi
![Page 57: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/57.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Current and Upcoming Usage of Cheshire II
• DIEPER Digitized European Periodicals project. – http://gdz.sub.uni-goettingen.de/dieper/
• NESSTAR (Networked Social Science Tools and Resources. – http://www.nesstar.org/
• FASTER – Flexible Access to Statistics Tables and Electronic Resources. (Continuation of NESSTAR)– http://www.faster-data.org/
• MASTER (Manuscript Access through Standards for Electronic Records. – http://www.cta.dmu.ac.uk/projects/master/
![Page 58: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/58.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Upcoming Usage of Cheshire II• ZETOC (Prototype of the Electronic Table of Contents
from the British Library)– http://zetoc.mimas.ac.uk/
• Archives Hub– http://www.archiveshub.ac.uk/
• RSLP Palaeography project– http://www.palaeography.ac.uk/
• British Natural History Museum, London • JISC data services directory hosted by MIMAS • Resource Discovery Network (RDN), where it will be
used to harvest RDN records from the various hubs using OAI and provide search
![Page 59: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/59.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Client/Server Architecture• Server Supports:
– Database storage– Indexing – Z39.50 access to local data– Boolean and Probabilistic Searching– Relevance Feedback– External SQL database support
• Client Supports:– Programmable (Tcl/Tk) Graphical User Interface– Z39.50 access to remote servers– SGML/XML & MARC formatting
• Combined Client/Server CGI scripting via WebCheshire used for web applications
• Mozilla client (under development in Liverpool)
![Page 60: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/60.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
SGML/XML Support• Example XML record for a DL document
<ELIB-BIB><BIB-VERSION>ELIB-v1.0</BIB-VERSION><ID>756</ID><ENTRY>June 12, 1996</ENTRY><DATE>June 1996</DATE><TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE><ORGANIZATION>University of California</ORGANIZATION><TYPE>report</TYPE><AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL><AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL><PROJECT>SNEP</PROJECT><SERIES>Vol 3</SERIES><PAGES>40</PAGES><TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF><PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF></ELIB-BIB>
![Page 61: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/61.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...
SGML/XML Support• Example SGML/MARC Record
![Page 62: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/62.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Component Extraction and Retrieval
• Any sub-elements of an SGML/XML document can be defined as a separately indexed “component”.
• Components can be ranked and retrieved independently of the source document (but linked back to their original source)
• For example paragraphs and abstracts in the full text of documents could be defined as components to provide paragraph-level search
• Example: Glassier archives…
![Page 63: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/63.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Component Extraction and Retrieval
• The Glassier archive is an EAD document (1.9 Mb in size)
• Contains “Series, Subseries, and Item level” descriptions of things in the archive
![Page 64: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/64.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Excerpt from Glasier Archive<c level="subseries"><did><head>GP-1-1: General correspondence. Public letters.</head><unitid id="gp-1-1">GP-1-1</unitid><unittitle>Glasier Papers. General correspondence. Public letters.</unittitle></did><arrangement><head>Arrangement </head><p>Public letters arranged alphabetically within each year </p></arrangement><c level="item" langmaterial="eng"><did><unitid id="gp-1-1-0001">GP-1-1-0001</unitid><unittitle>Letter from Richard Murray. <geogname>Glasgow</geogname>; <unitdate>7 Apr 1879</unitdate>.</unittitle><origination><persname>Murray, Richard</persname></origination><physdesc><extent>1 letter</extent></physdesc></did><note><p>Employment reference for J.B.G. as draughtsman<subject>Glasier, JohnBruce</subject></p></note></c>
ETC….
![Page 65: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/65.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Example Component Def…<COMPONENTS><COMPONENTDEF><COMPONENTNAME>
/home/ray/Work/Glasier_test/indexes/COMPONENT_DB1 </COMPONENTNAME>
<COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG><tagspec><FTAG> c </FTAG><ATTR> level <VALUE>item</VALUE></ATTR></tagspec></COMPSTARTTAG><COMPONENTINDEXES><!-- First index def -->…
![Page 66: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/66.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Components
• Both individual tags and “ranges” with a starting tag and (different) ending tag can be used as components
• Components permit parts of complex SGML/XML documents to be treated as separate documents
![Page 67: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/67.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Cheshire II Searching
Z39.50 Internet
ImagesScannedText
Local Remote
Z39.50
Z39.50
Z39.50
![Page 68: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/68.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Probabilistic Retrieval: Logistic Regression attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document -- logged
![Page 69: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/69.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Probabilistic Retrieval: Logistic Regression
6
10),|(
iii XccDQRP
Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:
For the 6 X attribute measures shown on the previous slide
![Page 70: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/70.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Cheshire Probabilistic Retrieval
• Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at retrieval time.
• Z39.50 “relevance” operator used to indicate probabilistic search
• Any index can have Probabilistic searching performed:– zfind topic @ “cheshire cats, looking glasses, march hares
and other such things”– zfind title @ caucus races
• Boolean and Probabilistic elements can be combined:– zfind topic @ government documents and title guidebooks
![Page 71: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/71.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Combining Search Types• It is also possible to combine the results of multiple
independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system) – E.g.:– Search of Full Text (Probabilistic)– Search of Full Text (Boolean)– Search of Components (Probabilistic)– Search of Titles (Probabilistic)– Search of Subject Headings (Probabilistic)
• All result sets are merged and re-ranked to produce the final list.
![Page 72: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/72.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Distributed Search: The Problem
• Hundreds or Thousands of servers with databases ranging widely in content, topic, format– Broadcast search is expensive in terms of
bandwidth and in processing too many irrelevant results
– How to select the “best” ones to search?• What to search first• Which to search next
– Topical /domain constraints on the search selections
– Variable contents of database (metadata only, full text…)
![Page 73: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/73.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
An Approach for Cross-Domain Resource Discovery
• MetaSearch– New approach to building metasearch based on Z39.50– Instead of using broadcast search we are using two Z39.50
Services• Identification of database metadata using Z39.50 Explain• Extraction of distributed indexes using Z39.50 SCAN
• Evaluation – How efficiently can we build distributed indexes? Very…– How effectively can we choose databases using the index?– How effective is merging search results from multiple
sources?– Hierarchies of servers (general/meta-topical/individual)?
![Page 74: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/74.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Z39.50 Overview
UI
UI
MapQuery
Internet
MapResults
MapQuery
MapResults
MapQuery
MapResults
SearchEngine
![Page 75: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/75.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Z39.50 Explain
• Explain supports searches for – Server-Level metadata
• Server Name• IP Addresses• Ports
– Database-Level metadata• Database name• Search attributes (indexes and combinations)
– Support metadata (record syntaxes, etc)
![Page 76: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/76.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Z39.50 SCAN
• Originally intended to support Browsing • Query for
– Database– Attributes plus Term (i.e., index and start point)– Step Size– Number of terms to retrieve– Position in Response set
• Results – Number of terms returned– List of Terms and their frequency in the database (for the
given attribute combination)
![Page 77: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/77.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Z39.50 SCAN Results% zscan title cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 27}{cat-fight 1}{catalan 19}{catalogu 37}{catalonia 8}{catalyt 2}{catania 1}{cataract 1}{catch 173}{catch-all 3}{catch-up 2} …
zscan topic cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 706}{cat-and-mouse 19}{cat-burglar 1}{cat-carrying 1}{cat-egory 1}{cat-fight 1}{cat-gut 1}{cat-litter 1}{cat-lovers 2}{cat-pee 1}{cat-run 1}{cat-scanners 1} …
Syntax: zscan indexname1 term stepsize number_of_terms pref_pos
![Page 78: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/78.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
MetaSearch Server Index Creation
• For all servers, or a topical subset…– Get Explain information (especially DC
mappings)– For each index (or each DC index)
• Use SCAN to extract terms and frequency• Add term + freq + source index + database metadata
to the metasearch “Collection Document” (XML) – Planned extensions:
• Post-Process indexes (especially Geo Names, etc) for special types of data
– e.g. create “geographical coverage” indexes
![Page 79: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/79.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
MetaSearch Approach
MetaSearchServer
Map ExplainAnd ScanQueries
Internet
MapResults
MapQuery
MapResults
SearchEngine
DB2DB 1
MapQuery
MapResults
SearchEngine
DB 4DB 3
DistributedIndex
SearchEngine
Db 6Db 5
![Page 80: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/80.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Known Problems
• Not all Z39.50 Servers support SCAN or Explain• Solutions:
– Probing for attributes instead of explain (e.g. DC attributes or analogs)
– We also support OAI and can extract OAI metadata for servers that support OAI
• Collection Documents are static and need to be replaced when the associated collection changes
![Page 81: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/81.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Evaluation
• Test Environment– TREC Tipster and FT data (approx. 3.5 GB)– Partitioned into 236 smaller collections based on source
and (for TIPSTER) date by month (Distributed Search Testbed built by French, et al.)
• High size variability (Range from 1 to thousands of docs)• 21,225,299 Words, 142,345,670 chars total for harvested records
• Efficiency (old data)– Average of 23.07 seconds per database to SCAN each
database (3.4 indexes on average)– Average of 14.07 seconds excluding FT (131 seconds for
FT database with 7 indexes)– Now collecting more information – so longer harvest times
longer, but still under one minute on average
![Page 82: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/82.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Evaluation
• Effectiveness – Still working on evaluation comparing our DB
ranking with the TIPSTER relevance judgements
– Can be compared with published selection methods (CORI, GlOSS, etc.) using the same testbed
![Page 83: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/83.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Future
• Testing of variant algorithms for ranking collections
• Application to real systems and testing in a production environment (Archives Hub)
• Logically Clustering servers by topic• Meta-Meta Servers (treating the
MetaSearch database as just another database)
![Page 84: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/84.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Distributed Metadata Servers
Replicatedservers
Meta-TopicalServers
General ServersDatabaseServers
![Page 85: 11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University](https://reader033.vdocuments.site/reader033/viewer/2022051516/56649d6c5503460f94a4cbb2/html5/thumbnails/85.jpg)
11/15/2001 Database Management -- Spring 2001 -- R. Larson
Further Information• Full Cheshire II client and server source is
available ftp://cheshire.berkeley.edu/pub/cheshire/– Includes HTML documentation
• Project Web Site http://cheshire.berkeley.edu/