content analysis and statistical properties of text

50
9/11/2000 Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

Upload: neorah

Post on 21-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Content Analysis and Statistical Properties of Text. Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval. Today. Overview of Content Analysis Text Representation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Content Analysis and Statistical Properties of Text

Ray Larson & Marti Hearst

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Page 2: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Today

• Overview of Content Analysis

• Text Representation

• Statistical Characteristics of Text Collections

• Zipf distribution

• Statistical dependence

Page 3: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Content Analysis• Automated Transformation of raw text into a form

that represent some aspect(s) of its meaning• Including, but not limited to:

– Automated Thesaurus Generation

– Phrase Detection

– Categorization

– Clustering

– Summarization

Page 4: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Techniques for Content Analysis• Statistical

– Single Document

– Full Collection

• Linguistic– Syntactic

– Semantic

– Pragmatic

• Knowledge-Based (Artificial Intelligence)• Hybrid (Combinations)

Page 5: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Text Processing

• Standard Steps:– Recognize document structure

• titles, sections, paragraphs, etc.

– Break into tokens• usually space and punctuation delineated

• special issues with Asian languages

– Stemming/morphological analysis

– Store in inverted index (to be discussed later)

Page 6: Content Analysis and Statistical Properties of Text

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

How isthe queryconstructed?

How isthe text processed?

Page 7: Content Analysis and Statistical Properties of Text

Information Organization and Retrieval

Document Processing Steps

Page 8: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Stemming and Morphological Analysis

• Goal: “normalize” similar words• Morphology (“form” of words)

– Inflectional Morphology• E.g,. inflect verb endings and noun number• Never change grammatical class

– dog, dogs– tengo, tienes, tiene, tenemos, tienen

– Derivational Morphology • Derive one word from another, • Often change grammatical class

– build, building; health, healthy

Page 9: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Automated Methods• Powerful multilingual tools exist for

morphological analysis– PCKimmo, Xerox Lexical technology– Require a grammar and dictionary– Use “two-level” automata

• Stemmers:– Very dumb rules work well (for English)– Porter Stemmer: Iteratively remove suffixes– Improvement: pass results through a lexicon

Page 10: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Errors Generated by Porter Stemmer (Krovetz 93)

Too Aggressive Too Timidorganization/ organ european/ europe

policy/ police cylinder/ cylindrical

execute/ executive create/ creation

arm/ army search/ searcher

Page 11: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Statistical Properties of Text

• Token occurrences in text are not uniformly distributed

• They are also not normally distributed

• They do exhibit a Zipf distribution

Page 12: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

A More Standard Collection

8164 the4771 of4005 to2834 a2827 and2802 in1592 The1370 for1326 is1324 s1194 that 973 by

969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO

1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE

Government documents, 157734 tokens, 32259 unique

Page 13: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Plotting Word Frequency by Rank

• Main idea: count– How many times tokens occur in the text

• Over all texts in the collection

• Now rank these according to how often they occur. This is called the rank.

Page 14: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Rank Freq Term1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form

150 2 enhanc151 2 energi152 2 emphasi153 2 detect154 2 desir155 2 date156 2 critic157 2 content158 2 consider159 2 concern160 2 compon161 2 compar162 2 commerci163 2 clause164 2 aspect165 2 area166 2 aim167 2 affect

Most and Least Frequent Terms

Page 15: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form

The Corresponding Zipf Curve

Page 16: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Zoom in on the Knee of the Curve43 6 approach44 5 work45 5 variabl46 5 theori47 5 specif48 5 softwar49 5 requir50 5 potenti51 5 method52 5 mean53 5 inher54 5 data55 5 commit

56 5 applic57 4 tool58 4 technolog59 4 techniqu

Page 17: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Zipf Distribution

• The Important Points:– a few elements occur very frequently– a medium number of elements have medium

frequency– many elements occur very infrequently

Page 18: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Zipf Distribution• The product of the frequency of words (f) and their rank (r) is

approximately constant– Rank = order of words’ frequency of occurrence

• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times– …

10/

/1

NC

rCf

Page 19: Content Analysis and Statistical Properties of Text

Information Organization and Retrieval

Zipf Distribution(linear and log scale)

Page 20: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

What Kinds of Data Exhibit a Zipf Distribution?

• Words in a text collection– Virtually any language usage

• Library book checkout patterns• Incoming Web Page Requests (Nielsen)

• Outgoing Web Page Requests (Cunha & Crovella)

• Document Size on Web (Cunha & Crovella)

Page 21: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Related Distributions/”Laws”

• Bradford’s Law of Scattering

• Lotka’s Law of Productivity

• De Solla Price’s Urn Model for “Cumulative Advantage Processes”

½ = 50% 2/3 = 66% ¾ = 75%Pick Pick

Replace +1 Replace +1

Page 22: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Very frequent word stems (Cha-Cha Web Index)

WORD FREQu 63245ha 65470california 67251m 67903

1998 68662system 69345t 70014about 70923servic 71822work 71958home 72131other 72726research 74264

1997 75323can 76762next 77973your 78489all 79993public 81427us 82551c 83250www 87029wa 92384program 95260

not 100204http 100696d 101034html 103698student 104635univers 105183inform 106463will 109700new 115937have 119428page 128702messag 141542from 147440you 162499edu 167298be 185162publib 189334librari 189347i 190635lib 223851that 227311s 234467berkelei 245406re 272123web 280966archiv 305834

Page 23: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Words that occur few times (Cha-Cha Web Index)

WORD FREQagendaaugust 1anelectronic 1centerjanuary 1packardequipment 1systemjuly 1systemscs186 1todaymcb 1workshopsfinding 1workshopsthe 1lollini 10+ 1

0 100summary 1

35816 135823 1

01d 135830 135837 1

02-156-10 135844 135851 1

02aframst 1311 1313 1

03agenvchm 1401 1408 1

408 1422 1424 1429 1

04agrcecon 104cklist 105-128-10 1

501 1506 1

05amstud 106anhist 107-149 107-800-80 107anthro 108apst 1

Page 24: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Consequences of Zipf• There are always a few very frequent tokens

that are not good discriminators.– Called “stop words” in IR– Usually correspond to linguistic notion of

“closed-class” words• English examples: to, from, on, and, the, ...• Grammatical classes that don’t take on new members.

• There are always a large number of tokens that occur once and can mess up algorithms.

• Medium frequency words most descriptive

Page 25: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Word Frequency vs. Resolving Power (from van Rijsbergen 79)

The most frequent words are not the most descriptive.

Page 26: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Statistical Independence vs. Statistical Dependence

• How likely is a red car to drive by given we’ve seen a black one?

• How likely is the word “ambulence” to appear, given that we’ve seen “car accident”?

• Color of cars driving by are independent (although more frequent colors are more likely)

• Words in text are not independent (although again more frequent words are more likely)

Page 27: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Statistical Independence

Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

),()()( yxPyPxP

Page 28: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Statistical Independence and Dependence

• What are examples of things that are statistically independent?

• What are examples of things that are statistically dependent?

Page 29: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Lexical Associations

• Subjects write first word that comes to mind– doctor/nurse; black/white (Palermo & Jenkins 64)

• Text Corpora yield similar associations• One measure: Mutual Information (Church and Hanks

89)

• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

)(),(

),(log),( 2 yPxP

yxPyxI

Page 30: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Statistical Independence• Compute for a window of words

collectionin words ofnumber

in occur -co and timesofnumber ),(

position at startingwindow within words

5)(say window oflength ||

),(1

),(

:follows as ),( eapproximat llWe'

/)()(

t.independen if ),()()(

||

1

N

wyxyxw

iw

ww

yxwN

yxP

yxP

NxfxP

yxPyPxP

i

wN

ii

w1 w11w21

a b c d e f g h i j k l m n o p

Page 31: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks

89)I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor

11.3 8 1105 Doctors 44 Dentists

10.7 30 1105 Doctors 241 Nurses

9.4 8 1105 Doctors 154 Treating

9.0 6 275 Examined 621 Doctor

8.9 11 1105 Doctors 317 Treat

8.7 25 621 Doctor 1407 Bills

Page 32: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with

0.95 41 284690 a 1105 doctors

0.93 12 84716 is 1105 doctors

Un-Interesting Associations with “Doctor”

(AP Corpus, N=15 million, Church & Hanks 89)

These associations were likely to happen because the non-doctor words shown here are very commonand therefore likely to co-occur with any noun.

Page 33: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Document Vectors

• Documents are represented as “bags of words”• Represented as vectors when used

computationally– A vector is like an array of floating point

– Has direction and magnitude

– Each vector holds a place for every term in the collection

– Therefore, most vectors are sparse

Page 34: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Document VectorsOne location for each word.

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Page 35: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Document VectorsOne location for each word.

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I

Page 36: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Document Vectors

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Page 37: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

We Can Plot the VectorsStar

Diet

Doc about astronomyDoc about movie stars

Doc about mammal behavior

Page 38: Content Analysis and Statistical Properties of Text

Information Organization and Retrieval

Documents in 3D Space

Page 39: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Content Analysis Summary• Content Analysis: transforming raw text into more

computationally useful forms• Words in text collections exhibit interesting

statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies

• Text documents are transformed to vectors– Pre-processing includes tokenization, stemming,

collocations/phrases– Documents occupy multi-dimensional space.

Page 40: Content Analysis and Statistical Properties of Text

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text inputHow isthe indexconstructed?

Page 41: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Inverted Index• This is the primary data structure for text indexes• Main Idea:

– Invert documents into a big index

• Basic steps:– Make a “dictionary” of all the tokens in the collection

– For each token, list all the docs it occurs in.

– Do a few things to reduce redundancy in the data structure

Page 42: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Inverted IndexesWe have seen “Vector files” conceptually.

An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Page 43: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

How Are Inverted Files Created• Documents are parsed to extract tokens.

These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 44: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

How Inverted Files are Created

• After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 45: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

How InvertedFiles are Created

• Multiple term entries for a single document are merged.

• Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 46: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

How Inverted Files are Created

• Then the file can be split into – A Dictionary file and – A Postings file

Page 47: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

How Inverted Files are CreatedDictionary PostingsTerm Doc # Freq

a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 48: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Page 49: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

How Inverted Files are Used

Dictionary PostingsDoc # Freq

2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Query on “time” AND “dark”

2 docs with “time” in dictionary ->IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->ID 2 from posting file

Therefore, only doc 2 satisfied the query.

Page 50: Content Analysis and Statistical Properties of Text

9/11/2000 Information Organization and Retrieval

Next Time

• Term weighting

• Statistical ranking