9/11/2001information organization and retrieval content analysis and statistical properties of text...
Post on 19-Dec-2015
215 Views
Preview:
TRANSCRIPT
9/11/2001 Information Organization and Retrieval
Content Analysis and Statistical Properties of Text
Ray Larson & Warren Sack
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and Retrieval
Lecture authors: Marti Hearst & Ray Larson & Warren Sack
9/11/2001 Information Organization and Retrieval
Last Time
• Boolean Model of Information Retrieval
9/11/2001 Information Organization and Retrieval
Boolean Queries• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
9/11/2001 Information Organization and Retrieval
Boolean• Advantages
– simple queries are easy to understand– relatively easy to implement
• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined
• Dominant language in commercial systems until the WWW
9/11/2001 Information Organization and Retrieval
How are the texts handled?
• What happens if you take the words exactly as they appear in the original text?
• What about punctuation, capitalization, etc.?• What about spelling errors? • What about plural vs. singular forms of words• What about cases and declension in non-
english languages?• What about non-roman alphabets?
9/11/2001 Information Organization and Retrieval
Today
• Overview of Content Analysis
• Text Representation
• Statistical Characteristics of Text Collections
• Zipf distribution
• Statistical dependence
9/11/2001 Information Organization and Retrieval
Content Analysis• Automated Transformation of raw text into a form
that represent some aspect(s) of its meaning• Including, but not limited to:
– Automated Thesaurus Generation
– Phrase Detection
– Categorization
– Clustering
– Summarization
9/11/2001 Information Organization and Retrieval
Techniques for Content Analysis• Statistical
– Single Document
– Full Collection
• Linguistic– Syntactic
– Semantic
– Pragmatic
• Knowledge-Based (Artificial Intelligence)• Hybrid (Combinations)
9/11/2001 Information Organization and Retrieval
Text Processing
• Standard Steps:– Recognize document structure
• titles, sections, paragraphs, etc.
– Break into tokens• usually space and punctuation delineated
– Stemming/morphological analysis• Special issues with morphologically complex languages (e.g.,
Finnish)
– Store in inverted index (to be discussed later)
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
How isthe queryconstructed?
How isthe text processed?
Information Organization and Retrieval
Document Processing Steps
9/11/2001 Information Organization and Retrieval
Stemming and Morphological Analysis
• Goal: “normalize” similar words• Morphology (“form” of words)
– Inflectional Morphology• E.g,. inflect verb endings and noun number• Never change grammatical class
– dog, dogs– tengo, tienes, tiene, tenemos, tienen
– Derivational Morphology • Derive one word from another, • Often change grammatical class
– build, building; health, healthy
9/11/2001 Information Organization and Retrieval
Automated Methods• Powerful multilingual tools exist for
morphological analysis– PCKimmo, Xerox Lexical technology– Require rules (a “grammar”) and dictionary
• E.g., “spy”+”s” “spi”+”es” == “spies”
– Use finite-state automata
• Stemmers:– Very dumb rules work well (for English)– Porter Stemmer: Iteratively remove suffixes– Improvement: pass results through a lexicon
9/11/2001 Information Organization and Retrieval
Porter Stemmer Rules
• sses ss
• ies I
• ed NULL
• ing NULL
• ational ate
• ization ize
9/11/2001 Information Organization and Retrieval
Errors Generated by Porter Stemmer (Krovetz 93)
Errors of Commission Errors of Ommission
organization organ urgency urgent
generalization generic triangle triangular
arm army explain explanation
9/11/2001 Information Organization and Retrieval
Table-Lookup Stemming• E.g., Karp, Schabes, Zaidel and Egedi, “A Freely
Available Wide Coverage Morphological Analyzer for English,” COLING-92
• Example table entries:
matrices matrix N 3pl
fish fish N 3sg
fish V INF
fish N 3pl
9/11/2001 Information Organization and Retrieval
Statistical Properties of Text
• Token occurrences in text are not uniformly distributed
• They are also not normally distributed
• They do exhibit a Zipf distribution
9/11/2001 Information Organization and Retrieval
Common words in Tom Sawyer
• the 3332
• and 2972
• a 1775
• …
• I 783
• you 686
• Tom 679
9/11/2001 Information Organization and Retrieval
Plotting Word Frequency by Rank
• Frequency: How many times do tokens (words) occur in the text (or collection).
• Rank: Now order these according to how often they occur.
9/11/2001 Information Organization and Retrieval
Empirical evaluation of Zipf’s Law for Tom Sawyer
word frequency rank f * r
the 3332 1 3332
and 2972 2 5944
a 1775 3 5235
he 877 10 8770
be 294 30 8820
there 222 40 8880
one 172 50 8600
friends 10 800 8000
9/11/2001 Information Organization and Retrieval
Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form
The Corresponding Zipf Curve
9/11/2001 Information Organization and Retrieval
Zoom in on the Knee of the Curve43 6 approach44 5 work45 5 variabl46 5 theori47 5 specif48 5 softwar49 5 requir50 5 potenti51 5 method52 5 mean53 5 inher54 5 data55 5 commit
56 5 applic57 4 tool58 4 technolog59 4 techniqu
9/11/2001 Information Organization and Retrieval
Zipf Distribution
• The Important Points:– a few elements occur very frequently– a medium number of elements have medium
frequency– many elements occur very infrequently
9/11/2001 Information Organization and Retrieval
Zipf Distribution• The product of the frequency of words (f) and their rank (r) is
approximately constant– Rank = order of words’ frequency of occurrence
• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times– …
10/
/1
NC
rCf
9/11/2001 Information Organization and Retrieval
Empirical evaluation of Zipf’s Law for Tom Sawyer
word frequency rank f * r
the 3332 1 3332
and 2972 2 5944
a 1775 3 5235
he 877 10 8770
be 294 30 8820
there 222 40 8880
one 172 50 8600
friends 10 800 8000
Information Organization and Retrieval
Zipf Distribution(linear and log scale)
9/11/2001 Information Organization and Retrieval
Consequences of Zipf• There are always a few very frequent tokens
that are not good discriminators.– Called “stop words” in IR– Usually correspond to linguistic notion of
“closed-class” words• English examples: to, from, on, and, the, ...• Grammatical classes that don’t take on new members.
• There are always a large number of tokens that occur once and can mess up algorithms.
• Medium frequency words most descriptive
9/11/2001 Information Organization and Retrieval
Frequency v. Resolving PowerLuhn's contribution to automatic text analysis: frequency data can be used to extract words and sentences to represent a documentLUHN, H.P., 'The automatic creation of literature abstracts', IBM Journal of Research and Development, 2, 159-165 (1958)
9/11/2001 Information Organization and Retrieval
What Kinds of Data Exhibit a Zipf Distribution?
• Words in a text collection– Virtually any language usage
• Library book checkout patterns• Incoming Web Page Requests (Nielsen)
• Outgoing Web Page Requests (Cunha & Crovella)
• Document Size on Web (Cunha & Crovella)
9/11/2001 Information Organization and Retrieval
Related Distributions/Laws
• Bradford’s Law of Literary Yield (1934)
If periodicals are ranked into three groups, each yielding the same number of articles on a specified topic, the numbers of periodicals in each group increased geometrically.
Thus, Leith (1969) shows that by reading only the “core” periodicals of your speciality you will miss about 40% of the articles relevant to it.
9/11/2001 Information Organization and Retrieval
Related Distributions/Laws
• Lotka’s distribution of literary productivity (1926)
The number of author-scientists who had published N papers in a given field was roughly 1/N**2 the number of authors who had published one paper only.
9/11/2001 Information Organization and Retrieval
Related Distributions/Laws
• For more examples, see Robert A. Fairthorne, “Empirical Distributions (Bradford-Zipf-Mandelbrot) for Bibliometric Description and Prediction,” Journal of Documentation, 25(4), 319-341
• Pareto distribution of wealth; Willis taxonomic distribition in biology; Mandelbrot on self-similarity and market prices and communication errors; etc.
9/11/2001 Information Organization and Retrieval
Statistical Independence vs. Statistical Dependence
• How likely is a red car to drive by given we’ve seen a black one?
• How likely is the word “ambulence” to appear, given that we’ve seen “car accident”?
• Color of cars driving by are independent (although more frequent colors are more likely)
• Words in text are not independent (although again more frequent words are more likely)
9/11/2001 Information Organization and Retrieval
Statistical Independence
Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.
),()()( yxPyPxP
9/11/2001 Information Organization and Retrieval
Statistical Independence and Dependence
• What are examples of things that are statistically independent?
• What are examples of things that are statistically dependent?
9/11/2001 Information Organization and Retrieval
Lexical Associations
• Subjects write first word that comes to mind– doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora yield similar associations• One measure: Mutual Information (Church and Hanks
89)
• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)
)(),(
),(log),( 2 yPxP
yxPyxI
9/11/2001 Information Organization and Retrieval
Statistical Independence• Compute for a window of words
collectionin wordsofnumber
in occur -co and timesofnumber ),(
position at starting ndow within wiwords
5)(say windowoflength ||
),(1
),(
:follows as ),( eapproximat llWe'
/)()(
t.independen if ),()()(
||
1
N
wyxyxw
iw
ww
yxwN
yxP
yxP
NxfxP
yxPyPxP
i
wN
ii
w1 w11w21
a b c d e f g h i j k l m n o p
9/11/2001 Information Organization and Retrieval
Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89)
I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor
11.3 8 1105 Doctors 44 Dentists
10.7 30 1105 Doctors 241 Nurses
9.4 8 1105 Doctors 154 Treating
9.0 6 275 Examined 621 Doctor
8.9 11 1105 Doctors 317 Treat
8.7 25 621 Doctor 1407 Bills
9/11/2001 Information Organization and Retrieval
I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with
0.95 41 284690 a 1105 doctors
0.93 12 84716 is 1105 doctors
Uninteresting Associations with “Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
These associations were likely to happen because the non-doctor words shown here are very commonand therefore likely to co-occur with any noun.
9/11/2001 Information Organization and Retrieval
Document Vectors
• Documents are represented as “bags of words”• Represented as vectors when used
computationally– A vector is an array of numbers – floating point
– Has direction and magnitude
– Each vector holds a place for every term in the collection
– Therefore, most vectors are sparse
9/11/2001 Information Organization and Retrieval
Document VectorsOne location for each word.
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
9/11/2001 Information Organization and Retrieval
Document VectorsOne location for each word.
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I
9/11/2001 Information Organization and Retrieval
Document Vectors
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
Document ids
9/11/2001 Information Organization and Retrieval
We Can Plot the VectorsStar
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
Information Organization and Retrieval
Documents in 3D Space
9/11/2001 Information Organization and Retrieval
Content Analysis Summary• Content Analysis: transforming raw text into more
computationally useful forms• Words in text collections exhibit interesting
statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies
• Text documents are transformed to vectors– Pre-processing includes tokenization, stemming,
collocations/phrases– Documents occupy multi-dimensional space.
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
9/11/2001 Information Organization and Retrieval
Inverted Index• This is the primary data structure for text indexes• Main Idea:
– Invert documents into a big index
• Basic steps:– Make a “dictionary” of all the tokens in the collection
– For each token, list all the docs it occurs in.
– Do a few things to reduce redundancy in the data structure
9/11/2001 Information Organization and Retrieval
Inverted IndexesWe have seen “Vector files” conceptually.
An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
9/11/2001 Information Organization and Retrieval
How Are Inverted Files Created• Documents are parsed to extract tokens.
These are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
9/11/2001 Information Organization and Retrieval
How Inverted Files are Created
• After all documents have been parsed the inverted file is sorted alphabetically.
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
9/11/2001 Information Organization and Retrieval
How InvertedFiles are Created
• Multiple term entries for a single document are merged.
• Within-document term frequency information is compiled.
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
9/11/2001 Information Organization and Retrieval
How Inverted Files are Created
• Then the file can be split into – A Dictionary file and – A Postings file
9/11/2001 Information Organization and Retrieval
How Inverted Files are CreatedDictionary PostingsTerm Doc # Freq
a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
9/11/2001 Information Organization and Retrieval
Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:
– document ID – frequency of term in doc (optional) – position of term in doc (optional)
• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2
• Also used for statistical ranking algorithms
9/11/2001 Information Organization and Retrieval
How Inverted Files are Used
Dictionary PostingsDoc # Freq
2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Query on “time” AND “dark”
2 docs with “time” in dictionary ->IDs 1 and 2 from posting file
1 doc with “dark” in dictionary ->ID 2 from posting file
Therefore, only doc 2 satisfied the query.
9/11/2001 Information Organization and Retrieval
Next Time
• Term weighting
• Statistical ranking
top related