cornell cs 502 metadata for the web issues and simple answers cs 502 – 20020221 carl lagoze –...

32
Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Upload: colleen-craig

Post on 02-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Metadata for the WebIssues and Simple Answers

CS 502 – 20020221Carl Lagoze – Cornell University

Page 2: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

“Metadata is data about data”

Page 3: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Some untested hypotheses

• Metadata is useful for…– People– Machines

• More metadata is better• (semi) automated digital libraries and simple

metadata

Page 4: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Some known facts

• Number and variety of metadata vocabularies will continue to increase

• The Tower of Babel is a franchise– There is not one common view of reality

• “The one thing I know about metadata is that it is expensive”

Page 5: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Are metadata and data distinguishable?

• Objectivity?• Intellectual property?• Structure?• Aboutness?

Page 6: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

The fiction of classification

…there is no classification of the universe that is not fictional and

conjectural.

Jorge Luis Borges

Page 7: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Lenses and Views

• All classification does and should provide a biased lens or view of reality

• Each view emphasizes certain characteristics and hides others

GeospatialRights

Museum

Page 8: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Reality is Complex

Created by:George Castaldo

Created on:1994

Created by:Leonardo da Vinci

Created on:1506

Relationship?

Page 9: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Objects are Related

IFLA Entity Model

Page 10: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Entities, Events, and Agents

Photographer

Camera type Software

Computer artist

Page 11: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Haven’t we done metadata already?

Page 12: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

What’s wrong with this model?

• Expensive– Complex (even for its original goal?) – Professional intervention (assumes single community

of expertise)

• Monolithic– One size fits all approach– Reflects its centralized system origins

• Bias towards physical artifacts– Fixed resources– Incomplete handling of resource evolution and other

resource relationships

• Anglo-centric

Page 13: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Web Challenge to Traditional Cataloging

• Scale

• Permanence

• Authenticity

• Organizational Context

• Custodial Control

• Variety

Page 14: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Internet Commons includes Multiple Communities

ScientificData

HomePages Geo

InternetCommons

Library

Museums

Commerce

Whatever...

Page 15: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Realities of Web search and discovery

• Search systems are motivated by advertising• Index coverage is unpredictable and limited• Too much recall, too little precision• Index spam abounds• Resources (and their names) are volatile

Page 16: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Metadata: Part of a Solution

• Structured data about data– helps to impose order on chaos– enables automated discovery/manipulation

• Variety across various dimensions:– specialization– decentralization– democratization

Page 17: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Web Metadata Issues:Description vs. Discovery

• Library cataloging motivated by describing resources

• Fuzzy search buckets– Separate books about Sigmund Freud versus books

by Sigmund Freud into different buckets– But, different types of data appropriate for different

buckets: URLs, date strings, word strings, names

• But general, fuzzy categories may not be sufficient for describing resources

Page 18: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Web Metadata Models:Drill-Down Searching Paradigm

• Moving along a specificity spectrum• Inter-domain vs. intra-domain terms, models,

query mechanisms• One size doesn't fit all

– Cognitive models of searching and browsing

Page 19: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Metadata Takes Many Forms

resourcediscovery

documentadministration

rightsmanagement

contentrating

security andauthentication

archivalstatus

products andservices

databaseschemas

process controlor description

Page 20: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Metadata:Part of the problem

cost

functionality

AACR2/MARC

googleDublin Core

Page 21: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Metadata Challenges

• Accommodate multiple varieties of metadata– community-specific functionality, creation,

administration, access

• Tensions– functionality and simplicity – extensibility and interoperability– human and machine creation and use

Page 22: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Interoperability has many facets

• Semantics– Meaning/classification/ontology

• Models/Structure– Entities and relationships

• Syntax– grammars to convey semantics and structure

Page 23: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Warwick Framework: Containing Chaos

• Conceptual Architecture for metadata from the Warwick Metadata Workshop (DC-2)

• Conceptual architecture to support the specification, collection, encoding, and exchange of modular metadata

• Provide context for metadata efforts (including Dublin Core)– avoids the “black-hole” of comprehensive element

sets– focuses interoperability issues at package level

Page 24: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Metadata Container

Container

Package

Dublin Core

Package

MARC record

Package

Indirect Reference

Package

Terms and Conditions

URI

Page 25: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Modularization Allows Distributed Management

• Communities of expertise (not software vendors) are responsible for:– Semantics– Registration– Administration– Access management– Authority of data– Sharing and Distribution

Page 26: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Modularization Implementation Issues

• Data encoding • Semantic interaction of overlapping sets

– between semantically-related packages– between semantically distinct packages

• Type registry

Page 27: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Dublin Core Metadata Initiative

• A simple set of properties to support resource discovery on the web (fuzzy search buckets)?

• A cross-domain switchboard for interoperable metadata?

• An extensible ontology for resource desciption?

http://dublincore.org

Page 28: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

The fifteen Dublin Core Elements

Creator Title Subject

Contributor Date Description

Publisher Type Format

Coverage Rights Relation

Source Language I dentifi er

http://www.dublincore.org/documents/1999/07/02/dces/

Page 29: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

A Pidgin for Digital Tourists

• Metadata is language• Dublin Core is a small and simple language -- a

pidgin -- for finding resources across domains.• Speakers of different languages naturally

"pidginize" to communicate– E.g., tourists using simple phrases to order beer

("zwei Bier bitte" "dva pivo" "biru o san bai"...)

• We are all "tourists" on the global Internet.

Page 30: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

A Grammar of Dublin Core

• http://www.dlib.org/dlib/october00/baker/10baker.html

• By design not as subtle as mother tongues, but easy to learn and extremely useful in practice

• Pidgins: small vocabularies (Dublin Core: fifteen special nouns and lots of optional adjectives)

• Simple grammars: sentences (statements) follow a simple fixed pattern...

Page 31: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Example Dublin Core statements

• Resource has Title 'Grammar of Dublin Core'.• Resource has Creator 'Tom Baker'.• Resource has Subject 'Metadata'.• Resource has Relation http://foo.org/file.htm.

Page 32: Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – 20020221 Carl Lagoze – Cornell University

Cornell CS 502

Resource has property

DC:CreatorDC:TitleDC:SubjectDC:Date...

X

implied subject

impliedverb

one of 15properties

property value(an appropriateliteral)