sims 296a-4 text data mining marti hearst uc berkeley sims
DESCRIPTION
Chapters to be Covered 1. Introduction (this week) 2. Linguistic Essentials 3. Mathematical Foundations 4. Mathematical Foundations (cont.) 5. Collocations 6. Statistical Inference 7. Word Sense Disambiguation 8. Markov Models 9. Text Categorization 10. Topics in Information Retrieval 11. Clustering 12. Lexical AcquisitionTRANSCRIPT
SIMS 296a-4Text Data Mining
Marti Hearst UC Berkeley SIMS
The Textbook Foundations of Statistical Natural
Language Processing, by Chris Manning and Hinrich Schuetze
We’ll go through one chapter each week
Chapters to be Covered1. Introduction (this week)2. Linguistic Essentials3. Mathematical Foundations 4. Mathematical Foundations (cont.)5. Collocations6. Statistical Inference7. Word Sense Disambiguation8. Markov Models9. Text Categorization10. Topics in Information Retrieval11. Clustering12. Lexical Acquisition
Introduction Scientific basis for this inquiry Rationalist vs. Empirical Approach
to Language Analysis– Justification for rationalist view:
poverty of the stimulus– Can overcome this if we assume
humans can generalize concepts
Introduction Competence vs. performance
theory of grammar– Focus on whether or not sentences
are well-formed– Syntactic vs. semantic well-
formedness– Conventionality of expression breaks
this notion
Introduction Categorical perception
– Recognizing phonemes, works pretty well– But not for larger phenomena like syntax– Language change example as counter-
evidence to strict categorizability of language
» kind of/sort of -- change parts of speech very gradually
» Occupied an intermediate syntactic status during the transition
– Better to adopt a probabilistic view (of cognition as well as of language)
Introduction The ambiguity of language
– Unlike programming languages, natural language is ambiguous if not understood in terms of all its parts»Sometimes truly ambiguous too
– Parsing with syntax only is harder than if using the underlying meaning as well
Classifying Application Types
Patterns Non- NovelNuggets
NovelNuggets
Non- textualdata Standard data
miningDatabasequeries ?
Textual dataComputational
linguisticsI nformation
retrievalReal text
data mining
Word Token Distribution Word tokens are not uniformly
distributed in text– The most common tokens are about 50%
of the occurrences– About 50% of the tokens occur only once– ~12% of the text consists of words
occurring 3 times or fewer Thus it is hard to predict the behavior
of many words in the text.
Zipf’s “Law”Histogram
0
50
100
150
200
250
300
350
Bin
Frequency
Frequency
10//1
NCrCf
Rank = order of words’ frequency of occurrence
The product of the frequency of words (f) and their rank (r) is approximately constant
Consequences of Zipf There are always a few very frequent
tokens that are not good discriminators.– Called “stop words” in Information Retrieval– Usually correspond to linguistic notion of
“closed-class” words» English examples: to, from, on, and, the, ...» Grammatical classes that don’t take on new members.
Typically– A few very common words– A middling number of medium frequency words– A large number of very infrequent words
Medium frequency words most descriptive
Word Frequency vs. Resolving Power (from van
Rijsbergen 79)The most frequent words are not (usually) the most descriptive.
Order by Rank vs. by Alphabetical Order
Other Zipfian “Laws” Conservation of speaker/hearer effort ->
– Number of meanings of a word is correlated with its meaning
– (there would be only one word for all meanings vs. only one meaning for all words)
– m inversely proportional to sqrt(f)– Important for word sense disambiguation
Content words tend to clump together– Important for computing term distribution
models
Is Zipf a Red Herring? Power laws are common in natural systems Li 1992 shows a Zipfian distribution of words can
be generated randomly – 26 characters and a blank– The blank or any other character is equally likely to be
generated.– Key insights:
» There are 26 times more words of length n+1 than of length n
» There is a constant ratio by which words of length n are more frequent than length n+1
Nevertheless, the Zipf insight is important to keep in mind when working with text corpora. Language modeling is hard because most words are rare.
Collocations Collocation: any turn of phrase or
accepted usage where the whole is perceived to have an existence beyond the sum of its parts.– Compounds (disk drive)– Phrasal verbs (make up)– Stock phrases (bacon and eggs)
Another definition:– The frequent use of a phrase as a fixed
expression accompanied by certain connotations.
Computing Collocations Take the most frequent adjacent
pairs– Doesn’t yield interesting results– Need to normalize for the word
frequency within the corpus. Another tack: retain only those with
interesting syntactic categories»adj noun»noun noun
Next Week Learn about linguistics! Decide on project participation