08 ir statistics of text (1)

7/29/2019 08 IR Statistics of Text (1)

1/23

Statistics of Text


2/23

Outline

Zipf distribution

Vocabulary growth (Heap's law)

Communication Theory


3/23

George Kingsley Zipf

1902 1950

American linguist

and philologist University

Lecture at

HarvardUniversity


4/23

Hans Peter Luhn

1896-1964

US Information Science

KWIC index; keyword in contextindex, a kind of automatic

indexing (1958 @ IBM) automatic construction of

abstracts on the basis ofstatistical occurances of words


5/23

Claude Shannon

1916 - 2001

Mathematician

Electronic Engineer Cryptographer

Father of

information theory


6/23

Zipf's Law

A few words occur very often

2 most frequent words can account for 10% of occurrences

top 6 words are 20%, top 50 words are 50%

Many words are infrequent

Principle of Least Effort

easier to repeat words rather than coining new ones

Rank Frequency Constant

pr= (Number of occurrences of word of rank r)/N

N total word occurrences probability that a word chosen randomly from the text will be the word of rank r

for D unique words pr= 1

r pr= A

A 0.1


7/23

Example of Frequent Words

Frequencies from 336,310 documents in the 1GB TREC Volume

3 Corpus 125,720,891 total word occurrences; 508,209 unique

words


8/23

Example of Zipf

Top 50 words

from 423 short

TIME magazine

articles(243,836 word

occurrences,

lowercased,

punctuation

removed, 1.6

MB)


9/23

Example of Zipf

Top 50 words

from 84,678

Associated

Press 1989articles

(37,309,114

word

occurrences,

lowercased,

punctuation

removed,

266MB)


10/23

Consequence of Zipf

There are few words that appear asdifferentiator or discriminator that is notgood. In IR we call this as stopwords

There are a large number of words thatonly appear once and can screwedalgorithm

Words with medium frequency aremostly descriptive words


11/23

Zipfs Law and H.P.Luhn

A plot of hyperbolic

curve relating f, the

frequency occurence

and r, the rank order

r

f


12/23

Zipfs Law and H.P.Luhn

Luhn (1958) states that the words ofthe most common and the mostunusual is not useful for indexing.


13/23

Predicting Occurrence Frequencies

A word that occurs n times has rank rn = AN/n; r pr= A

Several words may occur n times

Assume rank given by rn

applies to last of the words that occur n

times

rn

words occur n times or more (ranks 1..rn)

rn+1words occur n+1 times or more

The number of words that occur exactly n times isIn

= rn

rn+1

= AN/n - AN /(n+1) = AN / (n(n+1))

Highest ranking term occurs once and has rank D = AN/1

Proportion of words with frequency n is In/D = 1/ (n(n+1))

Proportion of words occurring once is 1/2


14/23

Does Real Data Fit Zipfs Law?

Based on Foundations of StatisticalNLP, Manning and Schutze, 2000

Zipf is quite accurate except for very highand low rank.


15/23

Zipf Impacts on IR

Positive: stopwords are a large part of thetext so that if eliminated would reduce costsof inverted-index storage.

Negative: For most of the words, collectingenough data for statistically meaningful

analysis (eg correlation analysis for queryexpansion) is very difficult because thewords are very rare.


16/23

Vocabulary Growth (Heaps Law)

How does the size of the overall vocabulary (number of uniquewords) grow with the size of the corpus?

Vocabulary has no upper bound due to proper names, typos, etc.

New words occur less frequently as vocabulary grows

If V is the size of the vocabulary and the n is the length of the corpusin words:

V = Kn (0<


17/23

Heaps' Law Data


18/23

Information Theory

Shannon studied theoretical limits for data compression andtransmission rate

Compression limits given by Entropy (H)

Transmission limits given by Channel Capacity (C) A number of language tasks have been formulated as a

noisy channel problem

i.e., determine the most likely input given the noisy output

OCR

Speech recognition

Question answering

Machine translation


19/23

Information Theory

Information content of a message isdependent on the receivers prior knowledgeas well as on the message itself

How much of the receivers uncertainty(entropy) is reduced

How predictable is the message

Information theory has been used forcompression, term weighting, and evaluationmeasures


20/23

Does The Zipf law also applyto Indonesian Language?


21/23

Zipf law to Indonesian Language

Research using the 15 students thesis S1 of Fasilkom UI andDepartment of Informatics ITB (Nazief, 1995)

In this study, the analysis is only performed on the main part ofthe thesis (only the main content of the chapter), not including

the preface, table of contents, or an attachment

Average word of each document is 9871.6. Of which there are1431.13 unique words. Everage of word is used 6.69 times asmuch.

For each document using the word as much as an average of16.33 times


22/23

Word Frequency Data

50% of words is used only appear onetime.

Word frequency differences are verylarge in order.


23/23

Assignment

Find (50) articles on tech news:

Count the frequent word

Number of occurances

Percentage of total, and

(Number of occurrences of word of rank r)/N

What is your conclusion about the Zipf law

consequence to your document collection?

08 ir statistics of text (1)

Documents