1 cs 430 / info 430 information retrieval lecture 22 non-textual materials 1

34
1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

Post on 22-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

1

CS 430 / INFO 430 Information Retrieval

Lecture 22

Non-Textual Materials 1

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

2

Course Administration

Thursday, November 11

No office hours

Tuesday, November 16

No class or office hours

Wednesday, November 17

Discussion class requires you to read three short papers.

Wednesday, December 1

Discussion class requires you to search for and read materials on a specified topic.

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

3

Course Administration

Discussion classes

• Attend!

• Speak!

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

4

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google File System." 19th ACM Symposium on Operating Systems Principles, October 2003.http://www.cs.rochester.edu/sosp2003/papers/p125-ghemawat.pdf

"Component failures are the norm rather than the exception.... The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies...."

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

5

Examples of Non-textual Materials

Content Attribute

maps lat. and long., content

photograph subject, date and place

bird songs and images field mark, bird song

software task, algorithm

data set survey characteristics

video subject, date, etc.

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

6

Possible Approaches to Information Discovery for Non-text Materials

Human indexing

Manually created metadata records

Automated information retrieval

Automatically created metadata records (e.g., image recognition)

Context: associated text, links, etc. (e.g., Google image search)

Multimodal: combine information from several sources

User expertise

Browsing: user interface design

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

7

Example 1: Blobworld

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

8

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

9

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

10

Surrogates

Surrogates for searching

• Catalog records

• Finding aids

• Classification schemes

Surrogates for browsing

• Summaries (thumbnails, titles, skims, etc.)

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

11

Catalog Records for Non-Textual Materials

• General metadata standards, such as Dublin Core and MARC, can be used to create a textual catalog record of non-textual items.

• Subject based metadata standards apply to specific categories of materials, e.g., FGDC for geospatial materials.

• Text-based searching methods can be used to search these catalog records.

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

12

Automated Creation of Metadata Records

Sometimes it is possible to generate metadata automatically from the content of a digital object. The effectiveness varies from field to field.

Examples

• Images -- characteristics of color, texture, shape, etc. (crude)

• Music -- optical recognition of score (good)

• Bird song -- spectral analysis of sounds (good)

• Fingerprints (good)

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

13

Collections: Finding Aids and the EAD

Finding aid

• A list, inventory, index or other textual document created by an archive, library or museum to describe holdings.

• May provide fuller information than is normally contained within a catalog record or be less specific.

• Does not necessarily have a detailed record for every item.

The Encoded Archival Description (EAD)

• A format (XML DTD) used to encode electronic versions of finding aids.

• Heavily structured -- much of the information is derived from hierarchical relationships.

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

14

Collection-Level Metadata

Collection-level metadata is used to describe a group of items.

For example, one record might describe all the images in a photographic collection.

Note: There are proposals to add collection-level metadata records to Dublin Core. However, a collection is not a document-like object.

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

15

Collection-Level Metadata

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

16

Example 2: Photographs

Photographs in the Library of Congress's American Memory collections

In American Memory, each photograph is described by a MARC record.

The photographs are grouped into collections, e.g., The Northern Great Plains, 1880-1920: Photographs from the Fred Hultstrand and F.A. Pazandak Photograph Collections

Information discovery is by:

• searching the catalog records

• browsing the collections

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

17

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

18

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

19

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

20

Photographs: Cataloguing Difficulties

Automatic

• Image recognition methods are very primitive

Manual

• Photographic collections can be very large

• Many photographs may show the same subject

• Photographs have little or no internal metadata (no title page)

• The subject of a photograph may not be known (Who are the people in a picture? Where is the location?)

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

21

Photographs: Difficulties for Users

Searching

• Often difficult to narrow the selection down by searching -- browsing is required

• Criteria may be different from those in catalog (e.g., graphical characteristics)

Browsing

• Offline. Handling many photographs is tedious. Photographs can be damaged by repeated handling

• Online. Viewing many images can be tedious. Screen quality may be inadequate.

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

22

Example 3: Geospatial Information

Example: Alexandria Digital Library at the University of California, Santa Barbara

• Funded by the NSF Digital Libraries Initiative since 1994.

• Collections include any data referenced by a geographical footprint.

terrestrial maps, aerial and satellite photographs, astronomical maps, databases, related textual information

• Program of research with practical implementation at the university's map library

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

23

Alexandria User Interface

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

24

Alexandria: Computer Systems and User Interfaces

Computer systems

• Digitized maps and geospatial information -- large files

• Wavelets provide multi-level decomposition of image

-> first level is a small coarse image-> extra levels provide greater detail

User interfaces

• Small size of computer displays

• Slow performance of Internet in delivering large files

-> retain state throughout a session

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

25

Alexandria: Information Discovery

Metadata for information discovery

Coverage: geographical area covered, such as the city of Santa Barbara or the Pacific Ocean.

Scope: varieties of information, such as topographical features, political boundaries, or population density.

Latitude and longitude provide basic metadata for maps and for geographical features.

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

26

Gazetteer

Gazetteer: database and a set of procedures that translate representations of geospatial references:

place names, geographic features, coordinatespostal codes, census tracts

Search engine tailored to peculiarities of searching for place names.

Research is making steady progress at feature extraction, using automatic programs to identify objects in aerial photographs or printed maps -- topic for long-term research.

Page 27: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

27

Gazetteers

The Alexandria Digital Library (ADL): geolibrary at University of California at Santa Barbara where a primary attribute of objects is location on Earth (e.g., map, satellite photograph).

Geographic footprint: latitude and longitude values that represent a point, a bounding box, a linear feature, or a complete polygonal boundary.

Gazetteer: list of geographic names, with geographic locations and other descriptive information.

Geographic name: proper name for a geographic place or feature (e.g., Santa Barbara County, Mount Washington, St. Francis Hospital, and Southern California)

Page 28: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

28

Use of a Gazetteer

• Answers the "Where is" question; for example, "Where is Santa Barbara?"

• Translates between geographic names and locations. A user can find objects by matching the footprint of a geographic name to the footprints of the collection objects.

• Locates particular types of geographic features in a designated area. For example, a user can draw a box around an area on a map and find the schools, hospitals, lakes, or volcanoes in the area.

Page 29: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

29

Alexandria Gazetteer: Example from a search on "Tulsa"

Feature name State County Type Latitude Longitude

Tulsa OK Tulsa pop pl 360914N 0955933W

Tulsa Country OK Osage locale 360958N 0960012WClub

Tulsa County OK Tulsa civil 360600N 0955400W

Tulsa Helicopters OK Tulsa airport 360500N 0955205WIncorporatedHeliport

Page 30: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

30

Challenges for the Alexandria Gazetteer

Content standard: A standard conceptual schema for gazetteer information.

Feature types: A type scheme to categorize individual features, is rich in term variants and extensible.

Temporal aspects: Geographic names and attributes change through time.

"Fuzzy" footprints: Extent of a geographic feature is often approximate or ill-defined (e.g., Southern California).

Page 31: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

31

Challenges for the Alexandria Gazetteer (continued)

Quality aspects:

(a) Indicate the accuracy of latitude and longitude data.

(b) Ensure that the reported coordinates agree with the other elements of the description.

Spatial extents:

(a) Points do not represent the extent of the geographic locations and are therefore only minimally useful.

(b) Bounding boxes, often include too much territory (e.g., the bounding box for California also includes Nevada).

Page 32: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

32

Alexandria Gazetteer

Alexandria Digital Library

Linda L. Hill, James Frew, and Qi Zheng, Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library. D-Lib Magazine, 5: 1, January 1999. http://www.dlib.org/dlib/january99/hill/01hill.html

Page 33: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

33

Alexandria Thesaurus: Example

canals

A feature type category for places such as the Erie Canal.

Used for:

The category canals is used instead of any of the following.

canal bends canalized streams ditch mouths ditches drainage canals drainage ditches ... more ...

Broader Terms:

Canals is a sub-type of hydrographic structures.

Page 34: 1 CS 430 / INFO 430 Information Retrieval Lecture 22 Non-Textual Materials 1

34

Alexandria Thesaurus: Example (continued)

canals (continued)

Related Terms:

The following is a list of other categories related to canals (non-hierarchial relationships).

channels locks transportation features tunnels

Scope Note:

Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power. » Definition of canals.