1 cs 430: information discovery lecture 7 descriptive metadata 3 dublin core automatic generation of...

42
1 CS 430: Information Discovery Lecture 7 Descriptive Metadata 3 Dublin Core Automatic Generation of Catalog Records

Upload: brent-campbell

Post on 02-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

1

CS 430: Information Discovery

Lecture 7

Descriptive Metadata 3Dublin Core

Automatic Generation of Catalog Records

2

Course Administration

• Relationship between Library of Congress, OCLC and American Memory

3

Dublin Core elements

1. Title The name given to the resource by the creator or publisher.

2. Creator The person or organization primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.

3. Subject The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemes is encouraged.

4

Dublin Core elements

4. Description A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources.

5. Publisher The entity responsible for making the resource available in its present form, such as a publishing house, a university department, or a corporate entity.

6. Contributor A person or organization not specified in a creator element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a creator element (for example, editor, transcriber, and illustrator).

5

Dublin Core elements

7. Date A date associated with the creation or availability of the resource.

8. Type The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary.

9. Format The data format of the resource, used to identify the software and possibly hardware that might be needed to display or operate the resource.

10. Identifier A string or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs.

6

Dublin Core elements

11. Source Information about a second resource from which the present resource is derived.

12. Language The language of the intellectual content of the resource.

13. Relation An identifier of a second resource and its relationship to the present resource. This element permits links between related resources and resource descriptions to be indicated. Examples include an edition of a work (IsVersionOf), or a chapter of a book (IsPartOf).

7

Dublin Core elements

14. Coverage The spatial locations and temporal durations characteristic of the resource.

15. Rights A rights management statement, an identifier that links to a rights management statement, or an identifier that links to a service providing information about rights management for the resource.

8

Qualifiers

Element qualifier

Example: Date

DC.Date -> Created: 1997-11-01

DC.Date -> Issued: 1997-11-15

DC.Date -> Available: 1997-12-01/1998-06-01

DC.Date -> Valid: 1998-01-01/1998-06-01

9

Qualifiers

Value qualifiers

Example: Subject

DC.Subject -> DDC: 509.123

DC.Subject -> LCSH: Digital libraries-United States

10

Metadata about subjects

(a) Classification (usually manual)

Dewey Decimal Classification (DDC)324.973 political web site

Library of Congress classification system (LCC)E840.8.G65 political web site

(b) Subject headings (usually manual)

Keywords assigned from controlled vocabulary e.g., Medical Subject Headings (MeSH)

Library of Congress subject headings (LCSH)Political campaigns - United States

(c) Terms extracted from text (automatic)

Automatic indexing [CS 430]Methods from computational linguistics [CS 374/474]

11

Dewey Decimal Classification

Main classes: 000 Computers, information, & general reference 100 Philosophy & psychology 200 Religion 300 Social sciences 400 Language 500 Science 600 Technology 700 Arts & recreation 800 Literature 900 History & geography

12

Dewey Decimal Classification

Hierarchy, e.g.:

600 Technology (Applied sciences)630 Agriculture and related technologies

636 Animal husbandry636.7 Dogs636.8 Cats

Uses:

• Shelving collections of physical objects so that items on similar subjects are shelved together

• Crude subject access

Scorpion project (OCLC):

Automatic subject recognition and assignment of DDC classes

13

14

15

Limits of Dublin Core

Complex objects

• Article within a journal

• A thumbnail of another image

• The March 28 final edition of a newspaper

Complete object

Sub-objects

Metadata records

16

Flat v. linked records

Flat record

All information about an item is held in a single Dublin Core record, including information about related items

convenient for access and preservation

information is repeated -- maintenance problem

Linked record

Related information is held in separate records with a link from the item record

less convenient for access and preservation

information is stored once

Compare with normal forms in relational databases

17

18

Dublin Core with qualifiers

<title>Digital Libraries and the Problem of Purpose</title>

<creator>David M. Levy</creator>

<publisher>Corporation for National Research Initiatives</publisher>

<date date-type = "publication">January 2000</date>

<type resource-type = "work">article</type>

<identifier uri-type = "DOI">10.1045/january2000-levy</identifier>

<identifier uri-type = "URL">http://www.dlib.org/dlib/january00/01levy.html</identifier>

<language>English</language>

<rights>Copyright (c) David M. Levy</rights>

19

Dublin Core with flat record extension

Continuation

<relation rel-type = "InSerial">

<serial-name>D-Lib Magazine</serial-name>

<issn>1082-9873</issn>

<volume>6</volume>

<issue>1</issue>

</relation>

20

Events

Version 1

New material

Version 2

Should Version 2 have its own record or should extra information be added to the Version 2 record?

How are these represented in Dublin Core?

21

Minimalist versus structuralist

Minimalist

15 elements, no qualifiers, suitable for non-professionals

encourage creators to provide metadata

Structuralists

15 elements, qualifiers, RDF, detailed coding rules

will require trained metadata experts

[For an example of how complex Dublin Core can become, see the source of: http://purl.org/dc/documents/rec-dces-199809.htm#]

22

Dublin Core in many languages

See:

Thomas Baker, Languages for Dublin Core, D-Lib MagazineDecember 1998, http://www.dlib.org/dlib/december98/12baker.html

23

Dublin Core: Personal Opinion

Dublin Core is a simple way to describe digital content that:

• is a single, self-contained object ("document-like")

• is static with time

• has few relationships

Some web sites satisfy these criteria

Dublin Core is not suitable for digital content that:

• is heavily structured

• changes dynamically

24

Automatic extraction of catalog data

Example: Dublin Core records for web pages

Strategies

• Manual by trained cataloguers - high quality records, but expensive and time consuming

• Entirely automatic - fast, almost zero cost, but poor quality

• Automatic followed by human editing - cost and quality depend on the amount of editing

• Manual collection level record, automatic item level record - moderate quality, moderate cost

25

DC-dot

DC-dot is a Dublin Core metadata editor for web pages, created by Andy Powell at UKOLN

http://www.ukoln.ac.uk/metadata/dcdot/

DC-dot has two parts:

(a) A skeleton Dublin Core record is created automatically from clues in the web page

(b) A user interface is provided for cataloguers to edit the record

26

27

Automatic record for CS 430 home page

DC-dot applied to http://www.cs.cornell.edu/courses/cs430/2001sp/

<link rel="schema.DC" href="http://purl.org/dc">

<meta name="DC.Title" content="CS 430: Information Discovery">

<meta name="DC.Subject" content="[email protected]; Course Structure; Readings and references; Slides; Basic Information; William Y. Arms; Information Retrieval Data Structures and Algorithms; [email protected]; Assignments; Syllabus; Text Book; Laptop computers; Assumed Background; Nomadic Computing Experiment; Notices; Course Description; Code of practice; Assignments and Grading; Last changed: February 6, 2001">

continued on next slide

28

Automatic record for CS 430 home page (continued)

DC-dot applied to http://www.cs.cornell.edu/courses/cs430/2001sp/

<meta name="DC.Publisher" content="Cornell University">

<meta name="DC.Date" scheme="W3CDTF" content="2001-02-07">

<meta name="DC.Type" scheme="DCMIType" content="Text">

<meta name="DC.Format" content="text/html">

<meta name="DC.Format" content="5781 bytes">

<meta name="DC.Identifier" content="http://www.cs.cornell.edu/courses/cs430/2001sp/">

29

Observations on DC-dot applied to CS430 home page

DC.Title is a copy of the html <title> field

DC.Publisher is the owner of the IP address where the page was stored

DC.Subject is a list of headings and noun phrases presented for editing

DC.Date is taken from the Last-Modified field in the http header

DC.Type and DC.Format are taken from the MIME type of the http response

DC.Identifier was supplied by the user as input

30

31

DC-dot applied to http://www.georgewbush.com/

<link rel="schema.DC" href="http://purl.org/dc">

<meta name="DC.Subject" content="George W. Bush; Bush; George Bush; President; republican; 2000 election; election; presidential election; George; B2K; Bush for President; Junior; Texas; Governor; taxes; technology; education; agriculture; health care; environment; society; social security; medicare; income tax; foreign policy; defense; government">

<meta name="DC.Description" content="George W. Bush is running for President of the United States to keep the country prosperous.">

continued on next slide

Automatic record for George W. Bush home page

32

DC-dot applied to http://www.georgewbush.com/

<meta name="DC.Publisher" content="Concentric Network Corporation">

<meta name="DC.Date" scheme="W3CDTF" content="2001-01-12">

<meta name="DC.Type" scheme="DCMIType" content="Text">

<meta name="DC.Format" content="text/html">

<meta name="DC.Format" content="12223 bytes">

<meta name="DC.Identifier" content="http://www.georgewbush.com/">

Automatic record for George W. Bush home page (continued)

33

Observations on DC-dot applied to George W. Bush home page

The home page has several meta tags:

<META NAME="TITLE" CONTENT="George W. Bush for President"> [The page has no html <title>]

<META NAME="CONTACT" CONTENT="George W Bush Campaign, P. O. Box 1902, Austin, TX 78767, Phone: (512) 637-2000">

<META NAME="DESCRIPTION" CONTENT="George W. Bush is running for President of the United States to keep the country prosperous.">

<META NAME="KEYWORDS" CONTENT="George W. Bush, Bush, George Bush, President, republican, 2000 election and more

34

Collection-level metadata

Several of the most difficult fields to extract automatically are the same across all pages in a web site.

Therefore create a collection record manually and combine it with automatic extraction of other fields at item level.

For the CS 430 home page, collection-level metadata:

<meta name="DC.Publisher" content="Cornell University">

<meta name="DC.Creator" content="William Y. Arms">

<meta name="DC.Rights" content="William Y. Arms, 2001">

See: Jenkins and Inman

35

Collection-level metadata

Compare:

(a) Metadata extracted automatically by DC-dot

(b) Collection-level record

(c) Combined item-level record (DC-dot plus collection-level)

(d) Manual record

36

37

Metadata extracted automatically by DC-dot

D.C. Field Qualifier Content

title Digital Libraries and the Problem of Purpose

subject not included in this slide

publisher Corporation for National Research Initiatives

date W3CDTF 2000-05-11

type DCMIType Text

format text/html

format 27718 bytes

identifier http://www.dlib.org/dlib/january00/01levy.html

38

Collection-level record

D.C. Field Qualifier Content

publisher Corporation for National Research Initiatives

type article

type resource work

relation rel-type InSerial

relation serial-name D-Lib Magazine

relation issn 1082-9873

language English

rights Permission is hereby given for the material in D-Lib Magazine to be used for ...

39

Combined item-level record (DC-dot plus collection-level)

D.C. Field Qualifier Content

title Digital Libraries and the Problem of Purpose

publisher (*) Corporation for National Research Initiatives

date W3CDTF 2000-05-11

type (*) article

type resource (*) work

type DCMIType Text

format text/html

format 27718 bytes

(*) indicates collection-level metadata

continued on next slide

40

Combined item-level record (DC-dot plus collection-level)

D.C. Field Qualifier Content

relation rel-type (*) InSerial

relation serial-name (*) D-Lib Magazine

relation issn (*) 1082-9873

language (*) English

rights (*) Permission is hereby given for the material in D-Lib Magazine to be used for ...

identifier http://www.dlib.org/dlib/january00/01levy.html

(*) indicates collection-level metadata

41

Manually created record

D.C. Field Qualifier Content

title Digital Libraries and the Problem of Purpose

creator (+) David M. Levy

publisher Corporation for National Research Initiatives

date publication January 2000

type article

type resource work

(+) entry that is not in the automatically generated records

continued on next slide

42

Manually created record

D.C. Field Qualifier Content

relation rel-type InSerial

relation serial-name D-Lib Magazine

relation issn 1082-9873

relation volume (+) 6

relation issue (+) 1

identifier DOI (+) 10.1045/january2000-levy

identifier URL http://www.dlib.org/dlib/january00/01levy.html

language English

rights (+) Copyright (c) David M. Levy

(+) entry that is not in the automatically generated records