ibe312: information architecture 2013 ch. 9 – metadata many of the slides in this slideset are...

44
IBE312: Information Architecture 2013 Ch. 9 – Metadata Many of the slides in this slideset are reproduced and/or modified content from publically available slidesets by Paul Jacobs (2012), The iSchool, University of Maryland http://terpconnect.umd.edu/~psjacobs/s12/INFM700s12.htm. These materials were made available and licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details.

Upload: noah-williamson

Post on 26-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

IBE312: Information Architecture2013

Ch. 9 – Metadata

Many of the slides in this slideset are reproduced and/or modified content from publically available slidesets by Paul Jacobs (2012),

The iSchool, University of Maryland http://terpconnect.umd.edu/~psjacobs/s12/INFM700s12.htm.

These materials were made available and licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details.

2

Metadata

• “Data about data” - Definitional and descriptive documentation/information about data…

• From Free On-line Dictionary of Computing:Data about data. In data processing, meta-data is definitional data that provides information about or documentation of other data managed within an application or environment.

For example, meta-data would document data about data elements or attributes, (name, size, data type, etc) and data about records or data structures (length, fields, columns, etc) and data about data (where it is located, how it is associated, ownership, etc.). Meta-data may include descriptive information about the context, quality and condition, or characteristics of the data.

• (Some other definitions.)

Metadata• Why do we need this?• Types of metadata

– Descriptive/subjective/content (e.g. author, subject, keywords, …)– Administrative (e.g. owner, rights, cost, creation date, version, …)– Technical (e.g. format, size, dependencies, programs)

– . . . .• In practical terms:– Metadata helps users locate, navigate, interpret content– Metadata helps organizations manage content– Metadata helps systems manipulate content

Data without Metadata…7/1/1988 OL 950 20.3 13 0.8 -0.1 33.1 27.8 5.3 5.927/2/1988 OL 950 24.2 12.6 1 -0.1 27.8 23.9 3.8 4.567/3/1988 OL . . . . . . . . .7/4/1988 OL 950 0.4 16.3 0.4 0.2 41 34.5 6.5 15.57/5/1988 OL 1005 32.9 18.9 1.4 0.3 29.8 23.7 6.1 14.237/6/1988 OL 1020 32.3 20.5 1.4 0.3 23.4 18.9 4.5 12.977/7/1988 OL 1015 36.8 24.9 1.7 0.5 18.6 15.3 3.2 13.927/8/1988 OL 925 42.8 25.6 2.5 0.6 23.7 19.9 3.9 15.187/9/1988 OL 945 23.3 27.8 0.7 0.8 27.7 23.5 4.3 12.337/10/1988 OL 1030 49.8 26.2 2.6 0.6 40.3 34 6.3 22.147/11/1988 OL 940 44.8 25.2 2.5 0.8 34 29.2 4.8 16.767/12/1988 OL 1010 47.6 26.9 2.6 0.7 47.3 39.6 7.7 16.137/13/1988 OL 945 36.5 22.6 1.9 0.6 36.7 32.6 4 15.57/14/1988 OL 950 19.5 18.6 0.4 0.5 302 39.1 262.9 11.077/15/1988 OL 955 31.7 15.7 1.5 0.4 29.7 25 4.7 9.497/16/1988 OL 955 23.3 14.5 1.8 0.8 23.4 20.7 2.7 8.147/17/1988 OL 1015 23.8 16.6 1.6 0.6 27.7 24.1 3.7 9.177/18/1988 OL 934 32.9 16.7 2.1 0.7 34 28.9 5.1 9.497/19/1988 OL 1010 29.2 20.4 1.9 0.7 26 22.3 3.7 10.447/20/1988 OL 952 44.8 24.8 2.1 0.8 31.7 27.5 4.2 10.757/21/1988 OL 1029 33.7 37.1 1.9 0.6 34.5 30.1 4.3 12.027/22/1988 OL 1017 34.3 32.9 2 0.7 31.4 26.2 5.1 12.657/23/1988 OL 1040 35.7 24.6 2 0.8 23.7 20.4 3.3 15.57/24/1988 OL 923 47.6 28.9 2.9 0.8 67.3 58.9 8.4 20.877/25/1988 OL 1030 58.3 32.6 2.9 0.7 68 59.3 8.7 22.147/26/1988 OL 950 49.3 29.2 3.4 0.6 86 75.1 10.9 21.197/27/1988 OL 1006 54.1 20.9 3.9 0.6 94 82.8 11.2 25.067/28/1988 OL 1010 40.5 16.5 1.7 0.3 41 34.4 6.6 6.547/29/1988 OL 1000 25.5 23.6 1.4 0.1 41 35.4 5.6 3.827/30/1988 OL 1005 47.9 17.6 0.8 0.1 18.3 15.9 2.3 4.197/31/1988 OL 1015 38 22.5 1.5 0.1 30 25.3 4.7 4.448/1/1988 OL 1018 21.2 8.8 1.1 -0.1 24.7 21.1 3.6 4.818/2/1988 OL 1004 38.5 22.8 2.1 0.3 54 46.8 7.2 9.88/3/1988 OL 1011 94 32.6 2.1 0.3 45.5 38.9 6.6 9.498/4/1988 OL 955 58.3 43.1 2.5 1.1 41 33.1 7.9 9.88/5/1988 OL 951 55.8 42.2 2.1 0.8 38 31 7 8.86

Who: authored it? to contact about data?

What: are contents of database?

When: was it collected? processed? finalized? Where: was the study done?

Why: was the data collected?

How: were data collected? processed? Verified?

… can be pretty useless!

Early Example of Metadata

Menagerie of Terms

• Classification• Hierarchies• Epistemology• Directories• Controlled vocabularies• Knowledge representation

Let’s focus on significant differences.Let’s focus on advantages/disadvantages.Let’s focus on how each is useful.

7

Controlled Vocabulary

• Any defined subset of natural language• List of equivalent terms (synonym rings)– Use search logs.

• List of preferred terms (authority files)– Commonly also include variant terms– Educating users, enabling browsing– Term rotation (pointers in index) p.201

• Classification scheme / taxonomy– Hierarchical relationships (narrower/broader)

Controlled Vocabulary

Queries can be ”exploded” to increase recall

Controlled Vocabularyauthority file – inclusive, preferred term can serve as the unique identifier for a

collection of terms, educate users

Related Terms & Techniques

• Taxonomies– Anything organized in some sort of hierarchical structure

• Tagging– Adding almost any kind of metadata to content, but now often

descriptive and user-provided• Thesauri

– Focus on relations between terms– Focus on “concepts”

• Ontologies– Usually model a specific domain or part of the world– Generally machine-readable

Increasing complexity and richness

Metadata

Taxonomies & Thesauri

Practical Uses

How are taxonomies, tagging, controlled vocabularies and thesauri used?

• The semantic gap: What’s the problem?– Synonymy – roughly, different words or phrases can be used

to express similar ideas (e.g. “notebook”, “laptop”)– Polysemy – roughly, the same word can have different

meanings (e.g., “line” (fishing, code, queue, . . .) )

• Taxonomies try to group similar concepts• “Tags” often assign words to concepts, making it easier

to find related concepts• Controlled vocabularies avoid ambiguity (like a specific

tag set)• Thesauri represent attempts to better organize mappings

between words and concepts

Do these present precision or recall problems?

Taxonomies

– Organization of objects according to some principle

– Familiar examples:• Linnaean taxonomy (for living organisms)• Web directories (e.g., Yahoo or ODP)• Corporate directories• Organization charts• Organizational structures previously discussed

Metadata

Taxonomies & Thesauri

Practical Uses

Tagging- e.g. Flickr – popular tags

Metadata

Taxonomies & Thesauri

Practical Uses

Flickr – related tags

Metadata

Taxonomies & Thesauri

Practical Uses

Del.icio.us – related tags

Metadata

Taxonomies & Thesauri

Practical Uses

Thesauri: Motivation• “Semantic gap” between concepts and words

• Online thesauri help mapping many synonyms or word variants onto one preferred term – improve precision in retrieval (p.203)

• Words are used to evoke concepts– Concrete objects: MacBook Pro, iPhone– Abstract ideas: freedom, peace

ConceptsWordsIdeas

Meaning

17

Thesauri

• Book of synonyms, often including related and contrasting words and antonyms.

• In this class:– A controlled vocabulary in which equivalence,

hierarchical, and associative relationships are identified for purposes of improved retrieval.

• Technical lingo …• Thesauri standards: ISO 2788, …

18

Thesauri Types

IA Uses of Thesauri

• For organization• For navigation• For indexing content• For searching

Applying IA Principles

• Focus on users and user needs – users are different, and have different models

• Focus on content – concepts are different, too – different levels, words, complexity, vagueness

• Examples:– What’s the difference between laptop, PDA, phone, and

convergence device?– When is “cancer research” “oncology”?– When a user browses a furniture catalog for chairs, do

you show them ottomans and footstools?

Standard Thesaurus StructureComputer

Notebook Laptop

DesktopReplacement Ultraportable Tablet PC

IS-A

IS-A

AKASynonyms (variants)

NarrowerTerms

BroaderTerms

Preferred

Semantic relationships in a thesaurus

• (pp. 204-205): Abbreviations: PT, VT, BT, NT, RT, Use (U) – VT use PT, Use For (UF) – full list of VT on the PT record, Scope Note (SN) – meaning of the term to rule out ambiguity.

Semantic relationships of a wine thesaurus, p. 206

Some Real Examples

• Content tagging and social media (e.g. flickr, del.i.cious)

• Special-purpose classification schemes and thesauri (e.g. art & architecture thesaurus – AAT, UMLS)

• General semantic tools and classification schemes (e.g., Princeton WordNet, Roget’s Thesaurus)

Art & Architecture Thesaurus

Metadata

Taxonomies & Thesauri

Practical Uses

http://www.getty.edu/research/conducting_research/vocabularies/aat/

UMLS (Unified Medical Labeling System)Source: National Library of Medicine (NIH)

Metathesaurus Semantic Network

SPECIALIST Lexicon +Tools

135 broad categories and54 relationships between them

1 million+biomedical concepts from over 100 sources

lexical information and programs for language processing

3 Knowledge Sourcesused separately or together

Metadata

Taxonomies & Thesauri

Practical Uses

E.g. UMLS (Unified Medical Labeling System)

Source: National Library of Medicine (NIH)

Metadata

Taxonomies & Thesauri

Practical Uses

Began in 1986 as long-term R&D project

Designed for systems developers Develop multi-purpose tools to

enhance understanding of medical meaning across systems

Overcome barriers to effective retrieval of machine-readable information

Overcome variety of ways the same concepts are expressed in machine readable and human language

UMLS UsesSource: National Library of Medicine (NIH)

Metadata

Taxonomies & Thesauri

Practical Uses

Information retrieval Thesaurus construction Natural language processing Automated indexing Electronic health records (EHR)

Distribution mechanism for HIPAA, CHI, PHIN regulatory standards SNOMED CT

UMLS Metathesaurushttp://www.nlm.nih.gov/research/umls/

UMLS Metathesaurushttp://www.nlm.nih.gov/research/umls/

UMLS Thesaurus Browserhttp://www.nlm.nih.gov/research/umls/

32

Semantic Relationships• Equivalence (PT = VT) • Hierarchical: Generic (Bird NT Magpie), whole-part (Foot NT big toe) or

instance (Seas NT Mediterranean Sea) – Faceted / multiple hierarchies

• Associative– Related terms (hammer RT nail)

• Preferred terms:– Form, selection, definition and specificity

• Polyhierarchy (Medline corss-lists viral pneumonia under both ...Fig 9-25, p. 220)

• Faceted classification – multiple taxonomies that focus on different dimensions of the content. (e.g. wine.com pp. 223-224.)

Associative Term

Poly-Hierarchies• Concepts can have multiple parents• Example:

• What are the advantages and disadvantages?• What’s the relationship to polysemy?

Cracow (Poland : Voivodship)

Auschwitz II-Birkenau (Poland : Death Camp)

Block 25 (Auschwitz II-Birkenau)

German death camps

Kanada(Auschwitz II-Birkenau)

From Shoah Foundation’s thesaurus of holocaust terms

Faceted Hierarchies

• Alternative to single and poly-hierarchies• Basic idea:– Describe objects along multiple facets– Each facet has its associated hierarchy

• Issues:– What’s a facet?– How do you navigate faceted hierarchies?

Faceted Browsing Example

Faceted Browsing Example

Demo: http://flamenco.berkeley.edu/demos.html

Advantages of Facets

• Integrates searching and browsing• Easy to build complex queries• Easy to narrow, broaden, shift focus• Helps users avoid getting lost• Helps to prevent “categorization wars”

Relationship to IA?

DatabaseWeb

ServerApplication

ServerNetwork

Ontologies are implicitly “hidden” here!!!

Flight

Trip

From:

Part-of

Airplane

Equipment

To:

Departure Time:

Arrival Time:

Origin:

Destination:

Type:

Capacity:

Rule: Arrival Time is always after Departure Time

Rule: Distance from Origin to Destination typical > 100 miles

Putting it all together…

DatabaseWeb

ServerApplication

ServerNetwork

DatabaseWeb

ServerNetwork

Two-Layer Architecture

Three-Layer Architecture

Apache mySQL

PHP

Popular Implementation

ContentMetadata

Presentation

SQL Database

PHP/HTML

Content PresentationA

B C

D E F

G H

You are here: A > C > D

Contents at D

Related - D - E

Hierarchy(child, parent) Content(id, attribute1, attribute2, attribute3, …)

Faceted Browsing

Matching Results

Filter by - Facet1

(possible values)

- Facet2

(possible values)

Hierarchy(child, parent) Content(id, attribute1, attribute2, attribute3, …)

Summary• Meta-data– General function– Types of meta-data

• Taxonomies and Thesauri– Role in organizing, navigating and searching

content– General-purpose taxonomies– Special-purpose taxonomies

• Practical use & implementation