cs 430 / info 430 information retrieval

28
1 CS 430 / INFO 430 Information Retrieval Lecture 16 Library Catalogs 1

Upload: hasad

Post on 11-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

CS 430 / INFO 430 Information Retrieval. Lecture 16 Library Catalogs 1. Course Administration. Information Retrieval with High Recall. Full-text Indexing (automated) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 430 / INFO 430  Information Retrieval

1

CS 430 / INFO 430 Information Retrieval

Lecture 16

Library Catalogs 1

Page 2: CS 430 / INFO 430  Information Retrieval

2

Course Administration

Page 3: CS 430 / INFO 430  Information Retrieval

3

Information Retrieval with High Recall

Full-text Indexing (automated)

• Text only. Most effective on medium-length documents on related topics. High recall requires tuning system to the specific collection and skilled users.

Catalogs and Indexes (created manually)

• Can be used for all formats of material

• Requires close quality control of metadata creation

• High recall requires tuning system to the specific collection and skilled users.

Page 4: CS 430 / INFO 430  Information Retrieval

4

Descriptive metadata

Information discovery is can be very effective when applied to metadata rather than raw information

• Allows fielded searching

author = "Goethe"

• Suitable for non-textual material

type = "picture" and subject = "Ithaca"

• Can be used with controlled vocabulary

language = "en" (English)

Page 5: CS 430 / INFO 430  Information Retrieval

5

Examples of Library Catalogs

Cornell University Library catalog:

http://catalog.library.cornell.edu/

Library of Congress, Prints and Photographs:

http://www.loc.gov/rr/print/catalog.html

Page 6: CS 430 / INFO 430  Information Retrieval

6

Origins of Library Catalogs

Bibliographic Objective:

• To bring together like items

• To differentiate among similar ones

Sir Anthony Panizzi, Keeper of Books at the British Museum (1856-67).

His Ninety-One Rules (1841) were the basis of modern catalog rules.

Page 7: CS 430 / INFO 430  Information Retrieval

7

Origins of Library Catalogs

Information Discovery:

• to enable a person to find a book of which either the author, title or subject is known

• to show what the library has by a given author, on a given subject, or in a given kind of literature

• to assist in the choice of a book as to its edition (bibliographically) or to its character (literary or topical).

Charles Ammi CutterLibrarian of the Boston Athenaeum

Rules for a Dictionary Catalog, 1874

Page 8: CS 430 / INFO 430  Information Retrieval

8

Origins of Library Catalogs

Classification:

Division of subject matter into a hierarchy. Typically used in libraries to provided a subject-based order for shelving books.

Melvil DeweyActing Librarian of Amherst College (1874)

Dewey Decimal system of book classification, uses the numbers 000 to 999

to cover the general fields of knowledge and decimals to fit special subjects.

Page 9: CS 430 / INFO 430  Information Retrieval

9

Technology

Materials to be catalogued:

• Originally books

• Extended to serials, maps, music, etc., but concepts still rely heavily on experience with books

Form of catalog:

• Entries in books (Panizzi)

• Index cards (Cutter)

• Online databases (Kilgour)

Page 10: CS 430 / INFO 430  Information Retrieval

10

Catalogs as Investments

Costs:

• Conventional Catalog Records are created by skilled librarians. (cost estimate $100 per record).

• OCLC's catalog has 52 million records. Total investment is several billion dollars.

Cataloguing Standards:

• Enable libraries to share records

• Combine records of the past with records created today

• Allow readers and librarians to move between libraries

Page 11: CS 430 / INFO 430  Information Retrieval

11

Shared Cataloguing: OCLC

OCLC -- Large centralized transaction processing database system

When a library catalogs a book it deposits MARC record in OCLC

Other libraries can copy the record

• saves duplication of cataloguing

• build database of holdings

OCLC database has 52 million records, serves 47,000 libraries

When developed in 1967, OCLC was a pioneering computer system (had to develop own network, computer terminal, etc.)

Page 12: CS 430 / INFO 430  Information Retrieval

12

Layers of a Library Catalog

Encoding

• Rules that define how catalog records are encoded in a computer system, e.g., XML mark-up.

Syntax

• Rules that define the fields and subfields, whether repeated, optional, etc.

Semantics

• Rules that define the values of the field and subfield, with instructions for cataloguers of what data to include and how to decide when choices have to be made.

Page 13: CS 430 / INFO 430  Information Retrieval

13

Library Cataloging using the Anglo American Cataloguing Rules

Anglo American Cataloguing Rules (AACR2)

• Rules for each category of material, e.g., monographs (books). Specify what fields should be used and what data to include in each field. Text strings were originally intended for printed catalog cards.

MARC format

• An exchange format for catalog records. Includes encoding rules and syntax specification.

"MARC Catalog"

• Catalog in MARC format, where content of each field follows AACR2.

Page 14: CS 430 / INFO 430  Information Retrieval

14

Anglo American Cataloguing Rules

The Anglo American Cataloguing (AACR) rules provide detailed rules for

• the choice of fields

• the content of the data that goes into each field

• the syntax of the data that goes into each field

The rules are an excellent example of technical writing, precise but clear. For an example, see:

http://www.cs.cornell.edu/Courses/cs430/2004fa/slides/AACR.pdf

Page 15: CS 430 / INFO 430  Information Retrieval

15

Example: Controlled Vocabulary

Level 1 Level 2

Arts ArchitectureArt therapyCareers*Computers in artDanceDrama/dramaticsFilmHistory*Informal education*Instructional issues*MusicPhotographyPopular culture*Process skills*Technology*Theater artsVisual arts

Terms marked * can appear in other hierarchies

Source: presentation by Diane Hillmann, 2004

Page 16: CS 430 / INFO 430  Information Retrieval

16

MARC Format

The MARC format was developed in the late 1960s as a tagging scheme for exchanging catalog records on magnetic tape. It remains the standard way to represent such data.

At present, MARC is steadily being converted (slowly) to modern computing formats, e.g., Unicode, XML.

Page 17: CS 430 / INFO 430  Information Retrieval

17

MARC: Monograph catalog record

Citation

Caroline R. Arms, editor, Campus strategies for libraries and electronic information. Bedford, MA: Digital Press, 1990.

Page 18: CS 430 / INFO 430  Information Retrieval

18

MARC fields

tag value

001 89-16879 r93

050 Z675.U5C16 1990

082 027.7/0973 20

245 Campus strategies for libraries and electronic title statement information/Caroline Arms, editor.

260 {Bedford, Mass.} : Digital Press, c1990. publisher

300 xi, 404 p. : ill. ; 24 cm. collation440 EDUCOM strategies series on information technology series title

504 Includes bibliographical references (p. {373}-381).

020 ISBN 1-55558-036-X : $34.95

Page 19: CS 430 / INFO 430  Information Retrieval

19

MARC fields (continued)

650 Academic libraries--United States--Automation. subject heading

650 Libraries and electronic publishing--United States.

650 Library information networks--United States.

650 Information technology--United States.

700 Arms, Caroline R. (Caroline Ruth)

040 DLC DLC DLC

043 n-us---

955 CIP ver. br02 to SL 02-26-90

985 APIF/MIG

Page 20: CS 430 / INFO 430  Information Retrieval

20

MARC Encoding

tag: 260

subfield a: {Bedford, Mass.} :

subfield b: Digital Press,

subfield c: c1990.

MARC encoding:

&2600#abc#{Bedford, Mass.} :#Digital Press,#c1990.%

[Definitely not a modern encoding!]

Note that the content is designed to be part of a printed catalog record and is not in a convenient format for computer manipulation.

Page 21: CS 430 / INFO 430  Information Retrieval

21

Name authority files

An Authority File "brings together like items and differentiates among similar ones."

• Caroline R. Arms or Caroline Ruth Arms?

• Which William Phillips of Cardiff?

• Mark Twain or Samuel Clemens?

• Epithets:

of Cardiffdoctor

• Dates:

1832 - 1876flourished 1860 circa 1832 - 1876

Page 22: CS 430 / INFO 430  Information Retrieval

22

LC Control Number: n 87870182 HEADING : Arms, Caroline R. (Caroline Ruth) 000 00907cz 2200205n 450 001 4383796 005 19890706143144.8 008 70909n|acannaab |a aaa c 010 __ |a n 87870182 035 __ |a (DLC)n 87870182 040 __ |a InU |c DLC |d DLC 100 10 |a Arms, Caroline R. |q (Caroline Ruth) 400 10 |w nna |a Arms, Caroline Ruth 400 10 |a Arms, C. R. |q (Caroline Ruth) 670 __ |a Arms, W.Y. Report on the performance problems of the

RLIN computer system, 1982: |b t.p. (Caroline R. Arms) 670 __ |a LC data base, 8/24/87 |b (hdg.: Arms, Caroline Ruth;

usage: Caroline R. Arms, C. R. Arms) 670 __ |a Campus networking strategies, 1988: |b CIP t.p.

(Caroline Arms) 670 __ |a Phone call to pub., 2/10/88 |b (Caroline Ruth Arms;

studied at Oxford) 670 __ |a Campus strategies for libraries and electronic

information, c1990: |b CIP t.p. (Caroline Arms) data sheet (b. 10-24-45)

953 __ |a bz46 |b bd24

Name authority: example

Page 23: CS 430 / INFO 430  Information Retrieval

23

Subject information

Library of Congress Subject Headings

Academic libraries--United States--Automation

Hierarchical classification

Library of Congress call number: Z675.U5C16

Dewey Decimal Classification: 027.7

Creation and maintenance of lists of subject headings and classifications is a never ending task.

Page 24: CS 430 / INFO 430  Information Retrieval

24

Online public access catalog (OPAC)

History: First stage

• Library mounts its MARC records on a central computer

• Provides a simple terminal interface and dedicated terminals

• Boolean search -- fielded searching

[Most university libraries reached this stage about 1990]

History: Second stage

• Library connects computer to a campus network and Internet

• Converts card catalog records to MARC (retrospective conversion)

Page 25: CS 430 / INFO 430  Information Retrieval

25

Library information systems

When the catalog is online ...

Add other collections and services:

• Secondary information (Inspec, Medline, Chemical Abstracts)• Reference works (dictionaries, encyclopedias)

Improve user interface

• Add full text searching• Add web interface

Add gateway to off-campus information sources:

• Scientific journals• Databases (census, genome)

Page 26: CS 430 / INFO 430  Information Retrieval

26

Library management systems

A library management system, sometimes called an integrated library system, integrates the internal processes of a library, e.g., acquisitions, cataloguing, binding, circulation, etc.

It usually contains an online public access catalog, but does not provide integrated services to users.

Library management systems are produced by small companies who lack the capital and technical expertise to develop modern digital libraries.

Page 27: CS 430 / INFO 430  Information Retrieval

27

Notes on MARC

A great achievement:

• Developed in 1960s

• Magnetic tape exchange format for printing catalog records

• The dawn of computing:

mixed upper and lower casevariable length fields, repeated fieldsnon-Roman scripts

• 100(?) million records with standard content and format

• Thousands of trained librarians (millions?)

Page 28: CS 430 / INFO 430  Information Retrieval

28

Notes on MARC

A great problem:

• Not designed for computer algorithms

• One record per item (poor links between records)

• Tied to traditional materials and traditional practices

• Not Unicode

• 100 of million records at $100 -- $10 billion

A classic legacy system!