c.watterscsci64031 term frequency and ir. c.watterscsci64032 what is a good index term occurs only...

32
C.Watters csci6403 1 Term Frequency and IR

Upload: stephanie-mildred-boone

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 1

Term Frequency and IR

Page 2: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 2

What is a good index term

• Occurs only in some documents• ???

– Too often – get whole doc set– Too seldom – get too few matches

• Need to know the distribution of terms!• Goal: get rid of “poor” index terms• Why

– faster to process– Smaller indices– Better results

Page 3: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 3

Look at

• Index term extraction

• Term distribution

• Growth of vocabulary

• Collocation of terms

Page 4: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 4

Important Words?

Enron Ruling Leaves Corporate Advisers Open to Lawsuits

By KURT EICHENWALD

A ruling last week by a federal judge in Houston may well have accomplished what a year's worth of reform by lawmakers and regulators has failed to achieve: preventing the circumstances that led to Enron's stunning collapse from happening again.

To casual observers, Friday's decision by the judge, Melinda F. Harmon, may seem innocuous and not surprising. In it, she held that banks, law firms and investment houses — many of them criticized on Capitol Hill for helping Enron construct off-the-books partnerships that led to its implosion — could be sued by investors who are seeking to regain billions of dollars they lost in

the debacle.

Page 5: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 5

Index term Preprocessing

• Lexical normalization (get terms)

• Stop lists (get rid of terms)

• Stemming (collapse terms)

• Thesaurus or categorization construction (replace terms)

Page 6: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 6

Lexical Normalization

• Stream of characters index terms• Problems??• Numbers – good index terms?• Hyphens – online on line on-line• Punctuation – remove?• Letter case ?

• Treat the query terms the same way

Page 7: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 7

Stop Lists

• 10 most frequent words => 20% occurrences

• Standard list of 28 words => 30%

• Look at 10 most frequent words in applet 1

Page 8: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 8

Stemming

• Plurals: car/cars

• Variants: react/reaction/reacted/reacting

• Category based: adheres/adhesion/adhesive

• Errors– Understemming: division/divide– Overstemming: experiment/experience

• Divine/divide

Page 9: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 9

Thesaurus

• Control the vocabulary

• Automobile (car, suv, sedan, convertible, van, roadster, …)

• Problems?

Page 10: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 10

What terms make good index terms?

• Resolving power or selection power?

• Most frequent?

• Least frequent?

• In between?

• Why not use all of them?

Page 11: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 11

Resolving Power

Page 12: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 12

2. Distribution of Terms in Text

• What terms occur very frequently

• What terms occur only once or twice

• What is the general distribution of terms in a document set

Page 13: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 13

Time magazine sample243,836 word occurrences

word freq r pr A

the 15.659 1 6.422 0.064

of 7,179 2 2.944 0.058

to 6,287 3 2.578 0.077

a 5,830 4 2.391 0.095

and 5,580 5 2.288 0.114

week 626 38 0.257 0.097

government 582 39 0.239 0.093

when 577 40 0.237 0.095

will 488 50 0.200 0.100

Page 14: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 14

Zipf’s Relationship

• Frequency of the ith most frequent term is inversely related to the frequency of the most frequent word

• fi = f1

• i

• where depends on the text (~1-2)

• Rank x Frequency = constant• constant ~.1

Page 15: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 15

Principle of Least effort

• Describe the weather today

• Easier to use the same words!

Page 16: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 16

Page 17: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 17

Word frequency & vocab growth

rank

FD

Corpus size

Page 18: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 18

Zipf’s Law

• A few words occur a lot– Top 10 words about 20% occurrences

• A lot of words occur rarely– Almost half of the terms occur only once

Page 19: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 19

Actual Zipf’s Law

• Rank x frequency = constant

• Frequency, pr , is probability that a word taken at random from N occurrences will have rank r

• Given D unique words Sum (pr) = 1

• r x pr = A

• A ~ 0.1

Page 20: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 20

Time magazine sample243,836 word occurrences

word freq r pr A

the 15.659 1 6.422 0.064

of 7,179 2 2.944 0.058

to 6,287 3 2.578 0.077

a 5,830 4 2.391 0.095

and 5,580 5 2.288 0.114

week 626 38 0.257 0.097

government 582 39 0.239 0.093

when 577 40 0.237 0.095

will 488 50 0.200 0.100

Page 21: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 21

Page 22: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

Using Zipf to predict frequencies

• r x pr = A• Word occurring n times has rank rn

• rn = AN/n• But several words may occur n times• We say rn refers to last word that occurs n times• So rn words occur n or more times• Number of unique terms,D, is highest rank with n=1

D = AN/1

• Number of words occurring n times, In

• In= rn- rn+1 = AN/(n(n+1))

Page 23: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 23

Zipf and Power Law

• Power law uses y=kxc

• Zipf is a power law with c = -1

• r=(AN)n-1

• On log-log plot expect straight line with slope = c

• So how does our Reuters data do?

Page 24: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 24

Zipf log-log curve

Log

freq

Log rank

Slope = c

Page 25: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 25

2. Vocabulary Growth

• How quickly does the vocabulary grow as the size of the data corpus grows?

• Upper bound?

• Rate of increase of new words?

• Can be derived from Zipf’s law

Page 26: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 26

Calculation

• Given n term occurrences in corpus

• D = knb

• Where 0<b<1, typically between .4 and .6 k usually between 10 and 100

(n is size of corpus in words)

Page 27: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 27

Page 28: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 28

3. Collocation of Terms

• Bag of word indexing is based on term independence

• Why do we do this?

• Should we do this?

• What could we do if we kept collocation?

Page 29: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 29

What is collocation

• Next to– Tape deck– Day pass

• Ordered– Pass day– Deck tape

• Adjacency

Page 30: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 30

What data do you need to use collocation?

• Word position

• Relative to?

• What about updates?

Page 31: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 31

Queries and Collocation

• “Information retrieval”

• Information (+- 2) retrieval

• ??

Page 32: C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set

C.Watters csci6403 32

Summary

• We can use general knowledge about term distribution to– Design more efficient systems– Choose effective indexing terms– Map queries to document indexes

• Now what??– Using keywords in IR systems– Most common IR models