08 ir statistics of text (1)
TRANSCRIPT
-
7/29/2019 08 IR Statistics of Text (1)
1/23
Statistics of Text
-
7/29/2019 08 IR Statistics of Text (1)
2/23
Outline
Zipf distribution
Vocabulary growth (Heap's law)
Communication Theory
-
7/29/2019 08 IR Statistics of Text (1)
3/23
George Kingsley Zipf
1902 1950
American linguist
and philologist University
Lecture at
HarvardUniversity
-
7/29/2019 08 IR Statistics of Text (1)
4/23
Hans Peter Luhn
1896-1964
US Information Science
KWIC index; keyword in contextindex, a kind of automatic
indexing (1958 @ IBM) automatic construction of
abstracts on the basis ofstatistical occurances of words
-
7/29/2019 08 IR Statistics of Text (1)
5/23
Claude Shannon
1916 - 2001
Mathematician
Electronic Engineer Cryptographer
Father of
information theory
-
7/29/2019 08 IR Statistics of Text (1)
6/23
Zipf's Law
A few words occur very often
2 most frequent words can account for 10% of occurrences
top 6 words are 20%, top 50 words are 50%
Many words are infrequent
Principle of Least Effort
easier to repeat words rather than coining new ones
Rank Frequency Constant
pr= (Number of occurrences of word of rank r)/N
N total word occurrences probability that a word chosen randomly from the text will be the word of rank r
for D unique words pr= 1
r pr= A
A 0.1
-
7/29/2019 08 IR Statistics of Text (1)
7/23
Example of Frequent Words
Frequencies from 336,310 documents in the 1GB TREC Volume
3 Corpus 125,720,891 total word occurrences; 508,209 unique
words
-
7/29/2019 08 IR Statistics of Text (1)
8/23
Example of Zipf
Top 50 words
from 423 short
TIME magazine
articles(243,836 word
occurrences,
lowercased,
punctuation
removed, 1.6
MB)
-
7/29/2019 08 IR Statistics of Text (1)
9/23
Example of Zipf
Top 50 words
from 84,678
Associated
Press 1989articles
(37,309,114
word
occurrences,
lowercased,
punctuation
removed,
266MB)
-
7/29/2019 08 IR Statistics of Text (1)
10/23
Consequence of Zipf
There are few words that appear asdifferentiator or discriminator that is notgood. In IR we call this as stopwords
There are a large number of words thatonly appear once and can screwedalgorithm
Words with medium frequency aremostly descriptive words
-
7/29/2019 08 IR Statistics of Text (1)
11/23
Zipfs Law and H.P.Luhn
A plot of hyperbolic
curve relating f, the
frequency occurence
and r, the rank order
r
f
-
7/29/2019 08 IR Statistics of Text (1)
12/23
Zipfs Law and H.P.Luhn
Luhn (1958) states that the words ofthe most common and the mostunusual is not useful for indexing.
-
7/29/2019 08 IR Statistics of Text (1)
13/23
Predicting Occurrence Frequencies
A word that occurs n times has rank rn = AN/n; r pr= A
Several words may occur n times
Assume rank given by rn
applies to last of the words that occur n
times
rn
words occur n times or more (ranks 1..rn)
rn+1words occur n+1 times or more
The number of words that occur exactly n times isIn
= rn
rn+1
= AN/n - AN /(n+1) = AN / (n(n+1))
Highest ranking term occurs once and has rank D = AN/1
Proportion of words with frequency n is In/D = 1/ (n(n+1))
Proportion of words occurring once is 1/2
-
7/29/2019 08 IR Statistics of Text (1)
14/23
Does Real Data Fit Zipfs Law?
Based on Foundations of StatisticalNLP, Manning and Schutze, 2000
Zipf is quite accurate except for very highand low rank.
-
7/29/2019 08 IR Statistics of Text (1)
15/23
Zipf Impacts on IR
Positive: stopwords are a large part of thetext so that if eliminated would reduce costsof inverted-index storage.
Negative: For most of the words, collectingenough data for statistically meaningful
analysis (eg correlation analysis for queryexpansion) is very difficult because thewords are very rare.
-
7/29/2019 08 IR Statistics of Text (1)
16/23
Vocabulary Growth (Heaps Law)
How does the size of the overall vocabulary (number of uniquewords) grow with the size of the corpus?
Vocabulary has no upper bound due to proper names, typos, etc.
New words occur less frequently as vocabulary grows
If V is the size of the vocabulary and the n is the length of the corpusin words:
V = Kn (0<
-
7/29/2019 08 IR Statistics of Text (1)
17/23
Heaps' Law Data
-
7/29/2019 08 IR Statistics of Text (1)
18/23
Information Theory
Shannon studied theoretical limits for data compression andtransmission rate
Compression limits given by Entropy (H)
Transmission limits given by Channel Capacity (C) A number of language tasks have been formulated as a
noisy channel problem
i.e., determine the most likely input given the noisy output
OCR
Speech recognition
Question answering
Machine translation
-
7/29/2019 08 IR Statistics of Text (1)
19/23
Information Theory
Information content of a message isdependent on the receivers prior knowledgeas well as on the message itself
How much of the receivers uncertainty(entropy) is reduced
How predictable is the message
Information theory has been used forcompression, term weighting, and evaluationmeasures
-
7/29/2019 08 IR Statistics of Text (1)
20/23
Does The Zipf law also applyto Indonesian Language?
-
7/29/2019 08 IR Statistics of Text (1)
21/23
Zipf law to Indonesian Language
Research using the 15 students thesis S1 of Fasilkom UI andDepartment of Informatics ITB (Nazief, 1995)
In this study, the analysis is only performed on the main part ofthe thesis (only the main content of the chapter), not including
the preface, table of contents, or an attachment
Average word of each document is 9871.6. Of which there are1431.13 unique words. Everage of word is used 6.69 times asmuch.
For each document using the word as much as an average of16.33 times
-
7/29/2019 08 IR Statistics of Text (1)
22/23
Word Frequency Data
50% of words is used only appear onetime.
Word frequency differences are verylarge in order.
-
7/29/2019 08 IR Statistics of Text (1)
23/23
Assignment
Find (50) articles on tech news:
Count the frequent word
Number of occurances
Percentage of total, and
(Number of occurrences of word of rank r)/N
What is your conclusion about the Zipf law
consequence to your document collection?