natural language processing - massachusetts...
TRANSCRIPT
Natural Language Processing:(Simple) Word Counting
Regina Barzilay
EECS Department
MIT
November 15, 2004
Today
• Corpora and its properties
• Zipf’s Law
• Examples of annotated corpora
• Word segmentation algorithm
Natural Language Processing:(Simple) Word Counting 1/35
Corpora
• A corpus is a body of naturally occurring text, stored in a machine-readable form
• A balanced corpus tries to be representative across a language or other domains
Natural Language Processing:(Simple) Word Counting 2/35
Word Counts
What are the most common words in the text? •
• How many words are there in the text?
• What are the properties of word distribution in large corpora?
We will consider Mark Twain’s Tom Sawyer
Natural Language Processing:(Simple) Word Counting 3/35
Most Common Words
Word Freq Use
the 3332 determiner (article)
and 2972 conjunction
a 1775 determiner
to 1725 preposition, inf. marker
of 1440 preposition
was 1161 auxiliary verb
it 1027 pronoun
in 906 preposition
that 877 complementizer
Tom 678 proper name
Natural Language Processing:(Simple) Word Counting 4/35
Most Common Words (Cont.)
Some observations:
Dominance of function words •
• Presence of corpus-dependent items (e.g., “Tom”)
Is it possible to create a truly “representative” sample of English?
Natural Language Processing:(Simple) Word Counting 5/35
How Many Words Are There?They picnicked by the pool, then lay back on the grass and looked at the stars.
• Type — number of distinct words in a corpus(vocabulary size)
• Token — total number of words in a corpus
Tom Sawyer: word types — 8, 018 word tokens — 71, 370 average frequency — 9
Natural Language Processing:(Simple) Word Counting 6/35
Frequencies of Frequencies
Word Frequency Frequency of Frequency
1 3993
2 1292
3 664
4 410
5 243
6 199
7 172
8 131
9 82
10 91
11-50 540
51-100 99
Most words in a corpus appear only once!
Natural Language Processing:(Simple) Word Counting 7/35
Zipf’s Law in Tom Sawyer
word Freq. (f) Rank (r) f ∗ r
the 3332 1 3332
and 2972 2 5944
a 1775 3 5235
he 877 10 8770
but 410 20 8400
be 294 30 8820
there 222 40 8880
one 172 50 8600
about 158 60 9480
never 124 80 9920
Oh 116 90 10440
Natural Language Processing:(Simple) Word Counting 8/35
Zipf’s LawThe frequency of use of the nth-most-frequently-used word in any natural language is approximately inversely proportional to n.
Zipf’s Law captures the relationship between frequency and rank:
1 f ∝=
r
There is a constant k such that:
= kf ∗ r
Natural Language Processing:(Simple) Word Counting 9/35
Zipf’s Law
Natural Language Processing:(Simple) Word Counting 10/35
Mandelbrot’s refinement
f = P (r + p)−B
or
logf = logP − Blog(r + p)
• P , B, p are parametrized for particular corpora
• Better fit at low and high ranks
Natural Language Processing:(Simple) Word Counting 11/35
Zipf’s Law and Principle of Least Effort
Human Behavior and the Principle of Least Effort(Zipf): “. . . Zipf argues that he found a unifying principle, the Principle of Least Effort, which underlies essentially the entire human condition (the book even includes some questionable remarks on human sexuality!). The principle argues that people will act so as to minimize their probable average rate of work”. (Manning&Schutze, p.23)
Natural Language Processing:(Simple) Word Counting 12/35
Other laws
Word sense distribution •
Phonemes distribution •
• Word co-occurrence patterns
Natural Language Processing:(Simple) Word Counting 13/35
Examples of collections approximatelyobeying Zipf’s law
• Frequency of accesses to web pages
Sizes of settlements•
• Income distribution amongst individuals
• Size of earthquakes
• Notes in musical performances
Natural Language Processing:(Simple) Word Counting 14/35
Is Zipf’s Law unique to human language?
(Li 1992): randomly generated text exhibits Zipf’s law
Consider a generator that randomly produces characters from the 26 letters of the alphabet and the blank.
26 1 p(wn) = (
27)n ∗
27
The words generated by such a generator obey a power law of the Mandelbrot:
• There are 26 times more words of length n + 1 thanwords of length n
Natural Language Processing:(Simple) Word Counting 15/35
• There is a constant ratio by which words of length n are more frequent that words of length n + 1
Natural Language Processing:(Simple) Word Counting 16/35
Sparsity
How often does “kick” occur in 1M words? 58 •
How often does kick “kick a ball” occur in 1M •
words? 0
Natural Language Processing:(Simple) Word Counting 17/35
Sparsity
There is no data like more data
How often does “kick” occur in 1M words? 58 •
How often does “kick a ball” occur in 1M words? 0 •
How often does “kick” occur in the web? 6 M •
How often does “kick a ball” occur in the web?•
8.000
Natural Language Processing:(Simple) Word Counting 18/35
Very Very Large DataBrill&Banko 2001: In the task of confusion set •
disambiguation increase of data size yield significant improvement over the best performing system trained on the standard training corpus size set
– Task: disambiguate between pairs such as too, to
– Training Size: varies from one million to one billion
– Learning methods used for comparison: winnow, perceptron, decision-tree
• Lapata&Keller 2002, 2003: the web can be used as a very very large corpus
– The counts can be noisy, but for some tasks this is not an issue
Natural Language Processing:(Simple) Word Counting 19/35
The Brown Corpus
Famous early corpus (Made by Nelson Francis and Henry Kucera at Brown University in the 1960s)
• A balanced corpus of written American English
– Newspaper, novels, non-fiction, academic
• 1 million words, 500 written texts
Do you think this is a large corpus?
Natural Language Processing:(Simple) Word Counting 20/35
Recent Corpora
Corpus Size Domain Language
NA News Corpus 600 million newswire American English
British National Corpus 100 million balanced British English
EU proceedings 20 million legal 10 language pairs
Penn Treebank 2 million newswire American English
Broadcast News spoken 7 languages
SwitchBoard 2.4 million spoken American English
For more corpora, check the Linguistic Data Consortium:http://www.ldc.upenn.edu/
Natural Language Processing:(Simple) Word Counting 21/35
Corpus Content
Genre:•
– newswires, novels, broadcast, spontaneous conversations
• Media: text, audio, video
• Annotations: tokenization, syntactic trees, semantic senses, translations
Natural Language Processing:(Simple) Word Counting 22/35
Example of Annotations: POS Tagging
POS tags encode simple grammatical functions
• Several tag sets:
– Penn tag set (45 tags)
– Brown tag set (87 tags)
– CLAWS2 tag set (132 tags)
Category Example Claws c5 Brown Penn
Adverb often, badly AJ0 JJ JJ
Noun singular table, rose NN1 NN NN
Noun plural tables, roses NN2 NN NN
Noun proper singular Boston, Leslie NP0 NP NNP
Natural Language Processing:(Simple) Word Counting 23/35
Issues in Annotations
Different annotation schemes for the same task are •
common
• In some cases, there is a direct mapping between schemes; in other cases, they do not exhibit any regular relation
• Choice of annotation is motivated by the linguistic,the computational and/or the task requirements
Natural Language Processing:(Simple) Word Counting 24/35
Tokenization
Goal: divide text into a sequence of words
• Word is a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis)
Is tokenization easy?
Natural Language Processing:(Simple) Word Counting 25/35
What’s a word?
• English:
– “Wash. vs wash”
– “won’t”, “John’s”
– “pro-Arab”, “the idea of a child-as-required-yuppie-possession must be motivating them”, “85-year-old grandmother”
• East Asian languages:
– words are not separated by white spaces
Natural Language Processing:(Simple) Word Counting 26/35
Word Segmentation
• Rule-based approach: morphological analysis based on lexical and grammatical knowledge
• Corpus-based approach: learn from corpora(Ando&Lee, 2000)
Issues to consider: coverage, ambiguity, accuracy
Natural Language Processing:(Simple) Word Counting 27/35
Motivation for Statistical Segmentation
• Unknown words problem:
– presence of domain terms and proper names
• Grammatical constrains may not be sufficient
– Example: alternative segmentation of noun phrases
Segmentation Translation
sha-choh/ken/gyoh-mu/bu-choh “president/and/business/general/manager”
sha-choh/ken-gyoh/mu/bu-choh “president/subsidiary business/Tsutomi[a name]/general manager’
Natural Language Processing:(Simple) Word Counting 28/35
Word SegmentationKey idea: for each candidate boundary, compare the frequency of the n-grams adjacent to the proposed boundary with the frequency of the n-grams that straddle it.
?S S
t t
1 2
1 2
T DI N G E V I
t3
For N = 4, consider the 6 questions of the form: ”Is #(si) ≥ #(tj)?”, where #(x) is the number of occurrences of x Example: Is “TING” more frequent in the corpus than ”INGE”?
Natural Language Processing:(Simple) Word Counting 29/35
Algorithm for Word Segmentationsn non-straddling n-grams to the left of location k1 sn non-straddling n-grams to the right of location k2 tn straddling n-gram with j characters to the right of location k j
I≥ (y, z) indicator function that is 1 when y ≥ z, and 0 otherwise.
1. Calculate the fraction of affirmative answers for each n in N :
2 n−1 n vn(k) = I≥(#(si ),#(tnj ))2 ∗ (n
1 − 1)
� �
i=1 j=1
2. Average the contributions of each n − gram order
1 �vN (k) = vn(k)
N n∈N
Natural Language Processing:(Simple) Word Counting 30/35
Algorithm for Word Segmentation (Cont.)
Place boundary at all locations l such that either:
• l is a local maximum: vN (l) > vN (l − 1) and vN (l) > vN (l + 1)
• vN (l) ≥ t, a threshold parameter
t V (k) N
A B | C D | W X | Y| Z
Natural Language Processing:(Simple) Word Counting 31/35
Experimental Framework
• Corpus: 150 megabytes of 1993 Nikkei newswire
• Manual annotations: 50 sequences for development set (parameter tuning) and 50 sequences for test set
• Baseline algorithms: Chasen and Juman morphological analyzers (115,000 and 231,000 words)
Natural Language Processing:(Simple) Word Counting 32/35
Evaluation Measures
tp — true positive fp — false positive tn — true negative fn — false negative
System target not target
selected tp fp
not selected fn tn
Natural Language Processing:(Simple) Word Counting 33/35
Evaluation Measures (Cont)• Precision — the measure of the proportion of selected items that
the system got right tp
P = tp+ fp
• Recall — the measure of the target items that the system selected:
tpR =
tp+ fn
F-measure: •2 ∗ PR
F = (R+ P )
Word precision (P) is the percentage of proposed brackets that match
word-level brackets in the annotation;
Word recall (R) is the percentage of word-level brackets that are
proposed by the algorithm.
Natural Language Processing:(Simple) Word Counting 34/35
Conclusions
• Corpora widely used in text processing
• Corpora used either annotated or raw
• Zipf’s law and its connection to natural language
• Sparsity is a major problem for corpus processing methods
Next time: Language modeling
Natural Language Processing:(Simple) Word Counting 35/35