08 ir statistics of text (1)

Upload: azura-akbar

Post on 14-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 08 IR Statistics of Text (1)

    1/23

    Statistics of Text

  • 7/29/2019 08 IR Statistics of Text (1)

    2/23

    Outline

    Zipf distribution

    Vocabulary growth (Heap's law)

    Communication Theory

  • 7/29/2019 08 IR Statistics of Text (1)

    3/23

    George Kingsley Zipf

    1902 1950

    American linguist

    and philologist University

    Lecture at

    HarvardUniversity

  • 7/29/2019 08 IR Statistics of Text (1)

    4/23

    Hans Peter Luhn

    1896-1964

    US Information Science

    KWIC index; keyword in contextindex, a kind of automatic

    indexing (1958 @ IBM) automatic construction of

    abstracts on the basis ofstatistical occurances of words

  • 7/29/2019 08 IR Statistics of Text (1)

    5/23

    Claude Shannon

    1916 - 2001

    Mathematician

    Electronic Engineer Cryptographer

    Father of

    information theory

  • 7/29/2019 08 IR Statistics of Text (1)

    6/23

    Zipf's Law

    A few words occur very often

    2 most frequent words can account for 10% of occurrences

    top 6 words are 20%, top 50 words are 50%

    Many words are infrequent

    Principle of Least Effort

    easier to repeat words rather than coining new ones

    Rank Frequency Constant

    pr= (Number of occurrences of word of rank r)/N

    N total word occurrences probability that a word chosen randomly from the text will be the word of rank r

    for D unique words pr= 1

    r pr= A

    A 0.1

  • 7/29/2019 08 IR Statistics of Text (1)

    7/23

    Example of Frequent Words

    Frequencies from 336,310 documents in the 1GB TREC Volume

    3 Corpus 125,720,891 total word occurrences; 508,209 unique

    words

  • 7/29/2019 08 IR Statistics of Text (1)

    8/23

    Example of Zipf

    Top 50 words

    from 423 short

    TIME magazine

    articles(243,836 word

    occurrences,

    lowercased,

    punctuation

    removed, 1.6

    MB)

  • 7/29/2019 08 IR Statistics of Text (1)

    9/23

    Example of Zipf

    Top 50 words

    from 84,678

    Associated

    Press 1989articles

    (37,309,114

    word

    occurrences,

    lowercased,

    punctuation

    removed,

    266MB)

  • 7/29/2019 08 IR Statistics of Text (1)

    10/23

    Consequence of Zipf

    There are few words that appear asdifferentiator or discriminator that is notgood. In IR we call this as stopwords

    There are a large number of words thatonly appear once and can screwedalgorithm

    Words with medium frequency aremostly descriptive words

  • 7/29/2019 08 IR Statistics of Text (1)

    11/23

    Zipfs Law and H.P.Luhn

    A plot of hyperbolic

    curve relating f, the

    frequency occurence

    and r, the rank order

    r

    f

  • 7/29/2019 08 IR Statistics of Text (1)

    12/23

    Zipfs Law and H.P.Luhn

    Luhn (1958) states that the words ofthe most common and the mostunusual is not useful for indexing.

  • 7/29/2019 08 IR Statistics of Text (1)

    13/23

    Predicting Occurrence Frequencies

    A word that occurs n times has rank rn = AN/n; r pr= A

    Several words may occur n times

    Assume rank given by rn

    applies to last of the words that occur n

    times

    rn

    words occur n times or more (ranks 1..rn)

    rn+1words occur n+1 times or more

    The number of words that occur exactly n times isIn

    = rn

    rn+1

    = AN/n - AN /(n+1) = AN / (n(n+1))

    Highest ranking term occurs once and has rank D = AN/1

    Proportion of words with frequency n is In/D = 1/ (n(n+1))

    Proportion of words occurring once is 1/2

  • 7/29/2019 08 IR Statistics of Text (1)

    14/23

    Does Real Data Fit Zipfs Law?

    Based on Foundations of StatisticalNLP, Manning and Schutze, 2000

    Zipf is quite accurate except for very highand low rank.

  • 7/29/2019 08 IR Statistics of Text (1)

    15/23

    Zipf Impacts on IR

    Positive: stopwords are a large part of thetext so that if eliminated would reduce costsof inverted-index storage.

    Negative: For most of the words, collectingenough data for statistically meaningful

    analysis (eg correlation analysis for queryexpansion) is very difficult because thewords are very rare.

  • 7/29/2019 08 IR Statistics of Text (1)

    16/23

    Vocabulary Growth (Heaps Law)

    How does the size of the overall vocabulary (number of uniquewords) grow with the size of the corpus?

    Vocabulary has no upper bound due to proper names, typos, etc.

    New words occur less frequently as vocabulary grows

    If V is the size of the vocabulary and the n is the length of the corpusin words:

    V = Kn (0<

  • 7/29/2019 08 IR Statistics of Text (1)

    17/23

    Heaps' Law Data

  • 7/29/2019 08 IR Statistics of Text (1)

    18/23

    Information Theory

    Shannon studied theoretical limits for data compression andtransmission rate

    Compression limits given by Entropy (H)

    Transmission limits given by Channel Capacity (C) A number of language tasks have been formulated as a

    noisy channel problem

    i.e., determine the most likely input given the noisy output

    OCR

    Speech recognition

    Question answering

    Machine translation

  • 7/29/2019 08 IR Statistics of Text (1)

    19/23

    Information Theory

    Information content of a message isdependent on the receivers prior knowledgeas well as on the message itself

    How much of the receivers uncertainty(entropy) is reduced

    How predictable is the message

    Information theory has been used forcompression, term weighting, and evaluationmeasures

  • 7/29/2019 08 IR Statistics of Text (1)

    20/23

    Does The Zipf law also applyto Indonesian Language?

  • 7/29/2019 08 IR Statistics of Text (1)

    21/23

    Zipf law to Indonesian Language

    Research using the 15 students thesis S1 of Fasilkom UI andDepartment of Informatics ITB (Nazief, 1995)

    In this study, the analysis is only performed on the main part ofthe thesis (only the main content of the chapter), not including

    the preface, table of contents, or an attachment

    Average word of each document is 9871.6. Of which there are1431.13 unique words. Everage of word is used 6.69 times asmuch.

    For each document using the word as much as an average of16.33 times

  • 7/29/2019 08 IR Statistics of Text (1)

    22/23

    Word Frequency Data

    50% of words is used only appear onetime.

    Word frequency differences are verylarge in order.

  • 7/29/2019 08 IR Statistics of Text (1)

    23/23

    Assignment

    Find (50) articles on tech news:

    Count the frequent word

    Number of occurances

    Percentage of total, and

    (Number of occurrences of word of rank r)/N

    What is your conclusion about the Zipf law

    consequence to your document collection?