natural language processing - massachusetts...

36
Natural Language Processing: (Simple) Word Counting Regina Barzilay EECS Department MIT November 15, 2004

Upload: hoanghanh

Post on 26-Mar-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Natural Language Processing:(Simple) Word Counting

Regina Barzilay

EECS Department

MIT

November 15, 2004

Page 2: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Today

• Corpora and its properties

• Zipf’s Law

• Examples of annotated corpora

• Word segmentation algorithm

Natural Language Processing:(Simple) Word Counting 1/35

Page 3: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Corpora

• A corpus is a body of naturally occurring text, stored in a machine-readable form

• A balanced corpus tries to be representative across a language or other domains

Natural Language Processing:(Simple) Word Counting 2/35

Page 4: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Word Counts

What are the most common words in the text? •

• How many words are there in the text?

• What are the properties of word distribution in large corpora?

We will consider Mark Twain’s Tom Sawyer

Natural Language Processing:(Simple) Word Counting 3/35

Page 5: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Most Common Words

Word Freq Use

the 3332 determiner (article)

and 2972 conjunction

a 1775 determiner

to 1725 preposition, inf. marker

of 1440 preposition

was 1161 auxiliary verb

it 1027 pronoun

in 906 preposition

that 877 complementizer

Tom 678 proper name

Natural Language Processing:(Simple) Word Counting 4/35

Page 6: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Most Common Words (Cont.)

Some observations:

Dominance of function words •

• Presence of corpus-dependent items (e.g., “Tom”)

Is it possible to create a truly “representative” sample of English?

Natural Language Processing:(Simple) Word Counting 5/35

Page 7: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

How Many Words Are There?They picnicked by the pool, then lay back on the grass and looked at the stars.

• Type — number of distinct words in a corpus(vocabulary size)

• Token — total number of words in a corpus

Tom Sawyer: word types — 8, 018 word tokens — 71, 370 average frequency — 9

Natural Language Processing:(Simple) Word Counting 6/35

Page 8: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Frequencies of Frequencies

Word Frequency Frequency of Frequency

1 3993

2 1292

3 664

4 410

5 243

6 199

7 172

8 131

9 82

10 91

11-50 540

51-100 99

Most words in a corpus appear only once!

Natural Language Processing:(Simple) Word Counting 7/35

Page 9: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Zipf’s Law in Tom Sawyer

word Freq. (f) Rank (r) f ∗ r

the 3332 1 3332

and 2972 2 5944

a 1775 3 5235

he 877 10 8770

but 410 20 8400

be 294 30 8820

there 222 40 8880

one 172 50 8600

about 158 60 9480

never 124 80 9920

Oh 116 90 10440

Natural Language Processing:(Simple) Word Counting 8/35

Page 10: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Zipf’s LawThe frequency of use of the nth-most-frequently-used word in any natural language is approximately inversely proportional to n.

Zipf’s Law captures the relationship between frequency and rank:

1 f ∝=

r

There is a constant k such that:

= kf ∗ r

Natural Language Processing:(Simple) Word Counting 9/35

Page 11: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Zipf’s Law

Natural Language Processing:(Simple) Word Counting 10/35

Page 12: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Mandelbrot’s refinement

f = P (r + p)−B

or

logf = logP − Blog(r + p)

• P , B, p are parametrized for particular corpora

• Better fit at low and high ranks

Natural Language Processing:(Simple) Word Counting 11/35

Page 13: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Zipf’s Law and Principle of Least Effort

Human Behavior and the Principle of Least Effort(Zipf): “. . . Zipf argues that he found a unifying principle, the Principle of Least Effort, which underlies essentially the entire human condition (the book even includes some questionable remarks on human sexuality!). The principle argues that people will act so as to minimize their probable average rate of work”. (Manning&Schutze, p.23)

Natural Language Processing:(Simple) Word Counting 12/35

Page 14: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Other laws

Word sense distribution •

Phonemes distribution •

• Word co-occurrence patterns

Natural Language Processing:(Simple) Word Counting 13/35

Page 15: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Examples of collections approximatelyobeying Zipf’s law

• Frequency of accesses to web pages

Sizes of settlements•

• Income distribution amongst individuals

• Size of earthquakes

• Notes in musical performances

Natural Language Processing:(Simple) Word Counting 14/35

Page 16: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Is Zipf’s Law unique to human language?

(Li 1992): randomly generated text exhibits Zipf’s law

Consider a generator that randomly produces characters from the 26 letters of the alphabet and the blank.

26 1 p(wn) = (

27)n ∗

27

The words generated by such a generator obey a power law of the Mandelbrot:

• There are 26 times more words of length n + 1 thanwords of length n

Natural Language Processing:(Simple) Word Counting 15/35

Page 17: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

• There is a constant ratio by which words of length n are more frequent that words of length n + 1

Natural Language Processing:(Simple) Word Counting 16/35

Page 18: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Sparsity

How often does “kick” occur in 1M words? 58 •

How often does kick “kick a ball” occur in 1M •

words? 0

Natural Language Processing:(Simple) Word Counting 17/35

Page 19: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Sparsity

There is no data like more data

How often does “kick” occur in 1M words? 58 •

How often does “kick a ball” occur in 1M words? 0 •

How often does “kick” occur in the web? 6 M •

How often does “kick a ball” occur in the web?•

8.000

Natural Language Processing:(Simple) Word Counting 18/35

Page 20: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Very Very Large DataBrill&Banko 2001: In the task of confusion set •

disambiguation increase of data size yield significant improvement over the best performing system trained on the standard training corpus size set

– Task: disambiguate between pairs such as too, to

– Training Size: varies from one million to one billion

– Learning methods used for comparison: winnow, perceptron, decision-tree

• Lapata&Keller 2002, 2003: the web can be used as a very very large corpus

– The counts can be noisy, but for some tasks this is not an issue

Natural Language Processing:(Simple) Word Counting 19/35

Page 21: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

The Brown Corpus

Famous early corpus (Made by Nelson Francis and Henry Kucera at Brown University in the 1960s)

• A balanced corpus of written American English

– Newspaper, novels, non-fiction, academic

• 1 million words, 500 written texts

Do you think this is a large corpus?

Natural Language Processing:(Simple) Word Counting 20/35

Page 22: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Recent Corpora

Corpus Size Domain Language

NA News Corpus 600 million newswire American English

British National Corpus 100 million balanced British English

EU proceedings 20 million legal 10 language pairs

Penn Treebank 2 million newswire American English

Broadcast News spoken 7 languages

SwitchBoard 2.4 million spoken American English

For more corpora, check the Linguistic Data Consortium:http://www.ldc.upenn.edu/

Natural Language Processing:(Simple) Word Counting 21/35

Page 23: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Corpus Content

Genre:•

– newswires, novels, broadcast, spontaneous conversations

• Media: text, audio, video

• Annotations: tokenization, syntactic trees, semantic senses, translations

Natural Language Processing:(Simple) Word Counting 22/35

Page 24: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Example of Annotations: POS Tagging

POS tags encode simple grammatical functions

• Several tag sets:

– Penn tag set (45 tags)

– Brown tag set (87 tags)

– CLAWS2 tag set (132 tags)

Category Example Claws c5 Brown Penn

Adverb often, badly AJ0 JJ JJ

Noun singular table, rose NN1 NN NN

Noun plural tables, roses NN2 NN NN

Noun proper singular Boston, Leslie NP0 NP NNP

Natural Language Processing:(Simple) Word Counting 23/35

Page 25: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Issues in Annotations

Different annotation schemes for the same task are •

common

• In some cases, there is a direct mapping between schemes; in other cases, they do not exhibit any regular relation

• Choice of annotation is motivated by the linguistic,the computational and/or the task requirements

Natural Language Processing:(Simple) Word Counting 24/35

Page 26: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Tokenization

Goal: divide text into a sequence of words

• Word is a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis)

Is tokenization easy?

Natural Language Processing:(Simple) Word Counting 25/35

Page 27: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

What’s a word?

• English:

– “Wash. vs wash”

– “won’t”, “John’s”

– “pro-Arab”, “the idea of a child-as-required-yuppie-possession must be motivating them”, “85-year-old grandmother”

• East Asian languages:

– words are not separated by white spaces

Natural Language Processing:(Simple) Word Counting 26/35

Page 28: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Word Segmentation

• Rule-based approach: morphological analysis based on lexical and grammatical knowledge

• Corpus-based approach: learn from corpora(Ando&Lee, 2000)

Issues to consider: coverage, ambiguity, accuracy

Natural Language Processing:(Simple) Word Counting 27/35

Page 29: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Motivation for Statistical Segmentation

• Unknown words problem:

– presence of domain terms and proper names

• Grammatical constrains may not be sufficient

– Example: alternative segmentation of noun phrases

Segmentation Translation

sha-choh/ken/gyoh-mu/bu-choh “president/and/business/general/manager”

sha-choh/ken-gyoh/mu/bu-choh “president/subsidiary business/Tsutomi[a name]/general manager’

Natural Language Processing:(Simple) Word Counting 28/35

Page 30: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Word SegmentationKey idea: for each candidate boundary, compare the frequency of the n-grams adjacent to the proposed boundary with the frequency of the n-grams that straddle it.

?S S

t t

1 2

1 2

T DI N G E V I

t3

For N = 4, consider the 6 questions of the form: ”Is #(si) ≥ #(tj)?”, where #(x) is the number of occurrences of x Example: Is “TING” more frequent in the corpus than ”INGE”?

Natural Language Processing:(Simple) Word Counting 29/35

Page 31: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Algorithm for Word Segmentationsn non-straddling n-grams to the left of location k1 sn non-straddling n-grams to the right of location k2 tn straddling n-gram with j characters to the right of location k j

I≥ (y, z) indicator function that is 1 when y ≥ z, and 0 otherwise.

1. Calculate the fraction of affirmative answers for each n in N :

2 n−1 n vn(k) = I≥(#(si ),#(tnj ))2 ∗ (n

1 − 1)

� �

i=1 j=1

2. Average the contributions of each n − gram order

1 �vN (k) = vn(k)

N n∈N

Natural Language Processing:(Simple) Word Counting 30/35

Page 32: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Algorithm for Word Segmentation (Cont.)

Place boundary at all locations l such that either:

• l is a local maximum: vN (l) > vN (l − 1) and vN (l) > vN (l + 1)

• vN (l) ≥ t, a threshold parameter

t V (k) N

A B | C D | W X | Y| Z

Natural Language Processing:(Simple) Word Counting 31/35

Page 33: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Experimental Framework

• Corpus: 150 megabytes of 1993 Nikkei newswire

• Manual annotations: 50 sequences for development set (parameter tuning) and 50 sequences for test set

• Baseline algorithms: Chasen and Juman morphological analyzers (115,000 and 231,000 words)

Natural Language Processing:(Simple) Word Counting 32/35

Page 34: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Evaluation Measures

tp — true positive fp — false positive tn — true negative fn — false negative

System target not target

selected tp fp

not selected fn tn

Natural Language Processing:(Simple) Word Counting 33/35

Page 35: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Evaluation Measures (Cont)• Precision — the measure of the proportion of selected items that

the system got right tp

P = tp+ fp

• Recall — the measure of the target items that the system selected:

tpR =

tp+ fn

F-measure: •2 ∗ PR

F = (R+ P )

Word precision (P) is the percentage of proposed brackets that match

word-level brackets in the annotation;

Word recall (R) is the percentage of word-level brackets that are

proposed by the algorithm.

Natural Language Processing:(Simple) Word Counting 34/35

Page 36: Natural Language Processing - Massachusetts …dspace.mit.edu/bitstream/handle/1721.1/36844/6-881Fall-2004/NR/r...We will consider Mark Twain’s Tom Sawyer Natural Language Processing:

Conclusions

• Corpora widely used in text processing

• Corpora used either annotated or raw

• Zipf’s law and its connection to natural language

• Sparsity is a major problem for corpus processing methods

Next time: Language modeling

Natural Language Processing:(Simple) Word Counting 35/35