1 nov 2001is202: information organization and retrieval lexical relations and wordnet ray larson...

32
Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture author: Warren Sack

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Lexical Relations and WordNet

Ray Larson & Warren Sack

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Lecture author: Warren Sack

Page 2: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Last Time

• What is Cognitive Science?

• What is Artificial Intelligence?– Knowledge Representation

• Languages and Programming Paradigms

– Representing Common Sense• Common Sense Interfaces• Story Understanding, Story Generation, and

Common Sense

Page 3: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Cognitive Science

• 10/30/01 – AI, knowledge representation and common sense

• 11/01/01 – Computational Linguistics, Cognitive Psychology and Lexical Knowledge

• 11/06/01 – AI and information extraction• 11/08/01 – Linguistics, Philosophy,

Psychology, categories, and cognition

Page 4: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Today

• Lexical relations– Linguistics

• Two approaches to semantics: – Compositional– Relational

– Psycholinguistics

• WordNet– Description– Structure– Applications

Page 5: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Levels of Linguistic Analysis

• Sentences– Phonological/Morphological analysis– Syntactic analysis– Semantic analysis

• More than one sentence– Pragmatic analysis

Page 6: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Phonology/Morphology

• Phonology: The study of the systems of sounds which are manifested in natural languages; the significant contrasts between sounds that are relevant to meaning.– E.g., consonants, vowels, stress, intonation, etc.

• Morphology: the forms of words– E.g., word=watched; morphs=watch+ed;

morphemes=watch+past

Page 7: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Syntax

The syntax of a language is to be understood as a set of rules which accounts for the distribution of word forms throughout the sentences of a language. These rules codify permissible combinations of classes of word forms.

Page 8: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Semantics

• Semantics is the study of linguistic meaning.

• Two standard approaches to lexical semantics (cf., sentential semantics; and, logical semantics):– (1) compositional– (2) relational

• Other approaches…

Page 9: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Pragmatics• Deixis

– E.g., “I’ll be back in an hour” depends upon the time of the utterance.

• Conversational implicature– A: “Can you tell me the time?”– B: “Well, the milkman has come.” [I don’t know exactly, but

perhaps you can deduce it from some extra information I give you.]

• Presupposition– “Are you still such a bad driver?”

• Speech acts– Constatives vs. performatives– e.g., “I second the motion.”

• Conversational Structure– E.g., turn-taking rules

Page 10: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Lexical Semantics: Compositional Approach

• Compositional lexical semantics, introduced by Katz & Fodor (1963), analyzes the meaning of a word in much the same way a sentence is analyzed into semantic components. The semantic components of a word are not themselves considered to be words, but are abstract elements (semantic atoms) postulated in order to describe word meanings (semantic molecules) and to explain the semantic relations between words. For example, the representation of bachelor might be ANIMATE and HUMAN and MALE and ADULT and NEVER MARRIED. The representation of man might be ANIMATE and HUMAN and MALE and ADULT; because all the semantic components of man are included in the semantic components of bachelor, it can be inferred that bachelor man. In addition, there are implicational rules between semantic components, e.g. HUMAN ANIMATE, which also look very much like meaning postulates.

• George Miller, “On Knowing a Word,” 1999

Page 11: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Lexical Semantics: Relational Approach

• Relational lexical semantics was first introduced by Carnap (1956) in the form of meaning postulates, where each postulate stated a semantic relation between words. A meaning postulate might look something like dog animal (if x is a dog then x is an animal) or, adding logical constants, bachelor man and never married [if x is a bachelor then x is a man and not(x has married)] or tall not short [if x is tall then not(x is short)]. The meaning of a word was given, roughly, by the set of all meaning postulates in which it occurs.

• George Miller, “On Knowing a Word,” 1999

Page 12: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Psycholinguistics

• The introduction of Noam Chomsky’s theory of syntax to psychologists:

• Miller, G.A., Galanter, E., Pribram, K.H. (1960) Plans and the Structure of Behavior.

• Some areas of psycholinguistics:– Children’s acquisition of language– First and second language learning– Artificial intelligence? (see Lyons, 1981)

Page 13: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

WordNet• Started in 1985 by George Miller, students, and

colleagues at the Cognitive Science Laboratory, Princeton University

• Can be downloaded for free: www.cogsci.princeton.edu/~wn/

• In terms of coverage, WordNet’s goals differ little from those of a good standard college-level dictionary, and the semantics of WordNet is based on the notionof word sense that lexicographers have traditionally used in writing dictionaries. It is in the organization of that information that WordNet aspires to innovation.(Miller, 1998, chapter 1)

Page 14: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Presuppositions of WordNet project

• Separability hypothesis: TThe lexical component of language can be separated and studied in its own right.

• Patterning hypothesis: People have knowledge of the systematic patterns and relations between word meanings.

• Comprehensiveness hypothesis: Computational linguistics programs need a store of lexical knowledge that is as extensive as that which people have.

Page 15: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

WordNet structure

• Synsets versus Words

Page 16: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

WordNet: SizePOS Unique Synsets

Strings

Noun 107930 74488 Verb 10806 12754 Adjective 21365 18523 Adverb 4583 3612 Totals 144684 109377

Page 17: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Structure of WordNet

Page 18: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Structure of WordNet

Page 19: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Structure of WordNet

Page 20: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Unique Beginners

• { entity, something, (anything having existence (living or nonliving)) }• { psychological_feature, (a feature of the mental life of a living

organism) }• { abstraction, (a general concept formed by extracting common features

from specific examples) }• { state, (the way something is with respect to its main attributes; "the

current state of knowledge"; "his state of health"; "in a weak financial state") }

• { event, (something that happens at a given place and time) }• { act, human_action, human_activity, (something that people do or

cause to happen) }• { group, grouping, (any number of entities (members) considered as a

unit) }• { possession, (anything owned or possessed) }• { phenomenon, (any state or process known through the senses rather

than by intuition or reasoning) }

Page 21: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Roget’s “Unique Beginners”The ontology of Roget’s is headed by six Classes. The first three Classes cover the

external world: Abstract Relations deals with such ideas as number, order and time;

Space is concerned with movement, shapes and sizes, while Matter covers the physical world and humankind’s perception of it by means of five senses. The remaining Classes deal with the internal world of human beings: the mind (Intellect), the will (Volition), the heart and soul (Emotion, Religion and Morality). There is a logical progression from abstract concepts, through the material universe, to mankind itself, culminating in what Roget saw as mankind’s highest achievements: morality and religion (Kirkpatrick, 1998). Class Four, Intellect, is divided into Formation of ideas and Communication of ideas, and Class Five, Volition, into Individual volition and Social volition. In practice, therefore, the Thesaurus is headed by eight Classes. A path in Roget’s ontology always begins with one of the Classes. It branches to one of the 39 Sections and then to one of the 990 Heads. Each Head is divided into paragraphs grouped by parts of speech: nouns, adjectives, verbs and adverbs.

From Mario Jarmasz, Stan Szpakowicz, “Roget’s Thesaurus as an Electronic Lexical Knowledge Base,” 2000.

Page 22: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

WordNet Browsers

• http://www.cogsci.princeton.edu/cgi-bin/webwn

• http://bogart.sip.ucm.es/~jorge/browser.htm

• http://www.visualthesaurus.com/

Page 23: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Other WordNetshttp://www.hum.uva.nl/~ewn/gwa/wordnet_table.htm

• Dutch

• Spanish

• Italian

• German

• French

• Czech

• Estonian

Page 24: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Forthcoming WordNets http://www.hum.uva.nl/~ewn/gwa/wordnet_table.htm

• Bengali• Bulgarian• Danish • Greek• Hebrew• Hindi• Hindi• Kannada• Latvian• Moldavian

• Romanian• Russian• Slovenian• Swedish• Tamil• Thai• Turkish• Yugoslavian• Norwegian• Icelandic

Page 25: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Psycholinguistic evidence for WordNet’s structure

• Bever and Rosenbaum, 1970:– A pistol is more dangerous than a rifle.– * A pistol is more dangerous than a gun.– * A gun is more dangerous than a pistol.

• Resnik, 1993– The direct object of the verb drink can be any

hyponym of the noun berverage.

• Collins and Quillian, 1969– The time required to verify the statement “A robin is

a bird” is shorter than the time required to verify the statement “A robin is an animal.”

Page 26: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Psycholinguistic evidence against WordNet’s structure

• Smith and Medin, 1981– The time required to verify that a chicken is a bird is

significantly longer than the time required to verify that a robin is a bird, even though chick and robin stand in the same taxonomic relation to bird.

• Rosch, 1973– Ratings of “typicality” have little to do with frequency or

familiarity.

• Lakoff, 1987– Concepts are represented, not by a list of distinguishing

features, but by the focal instances (or prototypes) that are the best examples of the prototype.

Page 27: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

WordNet Applications

• Using WordNet as a data structure. Many languages used by computational linguists and natural language processing researchers now have WordNet packages. E.g., for Perl– Lingua::Wordnet, and– Lingua::Wordnet::Analysis

by Dan Brian, http://search.cpan.org/search?dist=Lingua-Wordnet

Page 28: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

WordNet Applications

• Information Retrieval: Voorhees, 1998– Query expansion via synsets– “sense-based” rather than “stem-based”

vectors– Unfortunately, in both cases, the inability to

automatically resolve word senses prevented any improvement from being made.

Page 29: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

WordNet Applications

• Textual Cohesion and the correction of Malapropisms: Hirst and St-Onge, 1998

Malapropism = the confounding of an intended word with another word of similar sound or similar spelling that has a quite different meaning; e.g., “Super bowl Superb owl”

Page 30: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

WordNet Applications

• Temporal Indexing through lexical chaining: Al-Halimi and Kazman, 1998

• Indexing transcripts of conference meetings by topic.

Page 31: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

WordNet Applications

• Conversation themes in Usenet:

Sack, 2000

Page 32: 1 Nov 2001IS202: Information Organization and Retrieval Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School

Next Time

• Information Extraction, Artificial Intelligence, and “Story Understanding” Revisited