natural language processing (nlp) i. introduction ii. issues in nlp iii. statistical nlp:...

21
Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Upload: cleopatra-hutchinson

Post on 31-Dec-2015

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Natural Language Processing (NLP)

I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Page 2: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

I. Introduction

Language: Medium for transfer of information

Natural Language: any language used by humans, not artificial, computer languages

Two Basic Questions in Linguistics:

Q1 (Syntax): What kinds of things do people say?

Q2 (Semantics): What do these things say about the world?

Page 3: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Natural Language Processing (NLP): As a branch of computer science, the goal is to use computers to process natural language.

Computational Linguistics (CL): As an interdisciplinary field between linguistics and computer science, it concerns with the computational aspects (e.g., theory building & testing) of human language.

NLP is an applied component of CL

Page 4: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Uses of Natural Language Processing:- Speech Recognition (convert continuous stream of sound waves into discrete words) – phonetics & signal processing- Language Understanding (extract ‘meaning’ from identified words) – syntax & semantics

- Language Generation/ Speech Synthesis: Generate appropriate meaningful NL responses to NL inputs

- Turing Test (some humans fail this test!)- ELIZA (Weizenbaum, 1966): Rule-based keyword matching- Loebner Prize (since 1991): $100,000; so far none with above 50% success

- Automatic Machine Translation (translate one NL into another) – hard problem- Automatic Knowledge Acquisition (computer programs that read books and listen human conversation to extract knowledge) – very hard problem

Page 5: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

II. Issues in NLP

1. Rational vs Empiricist Approach

2. Role of Nonlinguistic Knowledge

Page 6: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

1. Rational vs Empiricist Approach

Rational Approach to Language (1960–1985):

Most of the human language faculty is hardwired in the brain at birth and inherited in the gene.

- Universal Grammar (Chomsky, 1957)- Explain why children can learn something as

complex as a natural language from limited input in such a short time (2-3 years)

- Poverty of the Stimulus (Chomsky, 1986): There are simply not enough inputs for children to learn key parts of language

Page 7: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Empiricist Approach (1920–60, 1985-present):

Baby’s brain with some general rules of language operation, but its detailed structure must be learned from external inputs (e.g., N–V–O vs N–O–V)

- Values of Parameters: A General language model is predetermined but the values of its parameters must be fine-tined

(e.g.) - Y = aX + b (a, b: parameters) - M/I Home: Basic Floor Plan plus Custom Options

Page 8: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

2. Role of Nonlinguistic Knowledge

Grammatical Parsing (GP) View of NLP: Grammatical principles and rules play a primary role in language processing.

Extreme GP view: Every grammatically well-formed sentence is meaningfully interpretable, and vice versa.- unrealistic view! (“All grammars leak” (Sapir, 1921))

(e.g.) Colorless green ideas sleep furiously

(grammatically correct but semantically strange) The horse raced past the barn fell

(ungrammatical but can be semantically ok)The horse that was raced (by someone) past the barn fell -- Nonlinguistic Knowledge

Page 9: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Examples of lexically ambiguous words (but semantically ok), The astronomer married the star. The sailor ate a submarine.

Time flies like an arrow. Our company is training workers.

Clearly, language processing requires more than grammatical information

Integrated Knowledge View of NLP:Language processing grammatical knowledge (grammaticality) and general world knowledge (conventionality).

(e.g.) John wanted money. He got a gun and walked into a liquor store. He told the owner he wanted some money. The owner gave John the money and John left.

This explains how difficult the NLP problem is and why no one has yet succeeded in developing a reliable NLP system.

Page 10: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach
Page 11: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

III. Statistical NLP: Corpus-based approach

Rather than studying language by observing language use in actual situation, researchers use a pre-collected body of texts called a corpus.

Brown Corpus (1960s): one million words put together at Brown University from fiction, newspaper articles, scientific text, legal text, etc.

Susanne Corpus: 130,000 words; a subset of the Brown Corpus; syntactically annotated; available free.

Canadian Hansards: Bilingual corpus; fully translated between English and French.

Page 12: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Example of the Corpus-based Approach to NLP

Mark Twain’s Tom Sawyer:- 71,370 words total (tokens)- 8,018 different words (types)

Word Freq Word Freq

the 3332 in 906 and 2972 that 877 a 1775 he 877 to 1725 I 783 of 1440 his 772 was 1161 you 686 it 1027 Tom 679

Q: Why are not the words equally frequent? What does it tell us about language?

Page 13: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

- Out of 8018 words (types), 3393 (50%) occurs only one, 1292 (16%) twice, 664 (8%) three times, …

- Over 90% of the word types occur 10 times or less.- Each of the most common 12 words occurs over 700 times

(1%), total over 12% together

Occurrence Count No. of Word Types

1 3993 2 1292

3 664 4 410

… … 10 91 11-50 540 51-100 99 >100 102

Page 14: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Zipf’s Law and Principle of Least Effort- Empirical law uncovered by Zipf in 1929:

f (word type frequency) x r (rank) = k (constant)

The higher frequency a word type is, the higher rank it is in the frequency list. According to the Zipf’s law, the actual frequency count should be inversely related to its rank value.

Principle of Least Effort: A unifying principle proposed by Zipf to account for the Zipf’s

law: “Both the speaker and the hearer are trying to minimize their effort. The former’s effort is conserved by having a small vocabulary of common words (I.e., larger f) whereas the latter’s effort is conserved by a having large vocabulary of rarer words (I.e, smaller f) so that messages are less ambiguous. The Zipf’s law represents an optimal compromise between these opposing efforts.”

Page 15: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach
Page 16: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Zipf’s on log-log Plot: (freq) = k/(rank)

Or, log(freq) = - log(rank) + log (k)

Page 17: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

“More exact” Zipf’s Law (Mandelbrot, 1954):

Mandelbrot derived a more general form of Zipf’s law from theoretical principles:

which reduces to Zipf’s law for a=0 and b=1.

fk

r a bk a b

( )( , , )0

Page 18: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

Mandelbrot’s Fit: log(freq) = -b*log(rank+a) + log(k)

Page 19: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

What does Zipf’s law tell us about language?

ANS: Not much (virtually nothing!)

A Zipf’s law can be obtained under the assumption that text is randomly produced by independently choosing one of N letters with equal probability r and the space with probability (1-Nr).

Thus, the Zipf’s law is about the distribution of words whereas

language, especially semantics, is about interrelations between words.

.

Page 20: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

a, b, c, d, ....,k, l, .. w, x, y, z, [ ]1/30 1/301/30 1/301/30 4/30... ...

kaf yoguu th oouqp qbnsii th

Zipf’s Law!

Page 21: Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

In short, the Zipf’s law does not indicate some deep underlying process of language.

Zipf’s law or the like is typical of many stochastic random processes, unrelated to the characteristic features of a particular random process. In short, it represents a phenomenon of universal occurrence, that contains no specific information about the underlying process. Rather, language specific information is hidden in the deviations from the law, not the law itself.