statistical methods
DESCRIPTION
Statistical Methods. Traditional grammars may be “brittle” Statistical methods are built on formal theories Vary in complexity from simple trigrams to conditional random fields Can be used for language identification, text classification, information retrieval, and information extraction. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/1.jpg)
Statistical Methods
Traditional grammars may be “brittle” Statistical methods are built on formal theories Vary in complexity from simple trigrams to
conditional random fields Can be used for language identification, text
classification, information retrieval, and
information extraction
![Page 2: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/2.jpg)
N-Grams
Text is comprised of characters (or words or
phonemes) An N-gram is a sequence of n consecutive
characters (or words ...) unigram, bigram, trigram
Technically, it is a Markov chain of order n-1 P(c
i | c
1:i-1) = P(c
i|c
i-n+1:ci-1)
Calculate N-grams by looking at large corpus
![Page 3: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/3.jpg)
Example – Language Identification
Use P(ci|c
i-2:ci-1,l), where l ranges over languages
About 100,000 characters of each language are needed l* = argmax
l P(l|c
1:N) = argmax
l P(l) P(c
i|c
i-2:ci-1,l)
Learn the model from a corpus P(l), the probability of a given language can be
estimated Other examples: spelling correction, genre
classification, and named-entity recognition
![Page 4: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/4.jpg)
Smoothing
Problem: What if a particular n-gram does not
appear in the training corpus? Probability would be 0 – should be a small, but
positive number Smoothing – adjusting the probability of low-
frequency counts Laplace: use 1/(n+2) instead of 0 (n observations) Backoff model: back off to n-1 grams
![Page 5: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/5.jpg)
Model Evaluation
Use cross-validation (split corpus into training and
evaluation sets) Need a metric for evaluation Can use perplexity to describe the probability of a
sequence Perplexity(c
1:N) = P(c
1:n)-1/N
Can be thought of as the reciprocal of probability
normalized by the sequence length
![Page 6: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/6.jpg)
N-gram Word Models
Can be used for text classification Example: spam vs. ham Problem: out-of-vocabulary word
Trick: During training, use <UNK> first time word is
seen, then after that use word regularly. Then when an
unknown word is seen, treat it as <UNK> Calculate probabilities from a corpus, then
randomly generate phrases
![Page 7: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/7.jpg)
Example – Spam Detection
Text classification problem Train for P(Message|spam) and P(Message|ham)
using n-grams Calculate P(message|spam) P(spam) and
P(message|ham) P(ham) and take whichever is
greater
![Page 8: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/8.jpg)
Spam Detection – Other Methods
Represent message as a set of feature/value pairs Apply a classification algorithm for the feature vector Strongly depends on the features chosen
Data compression Data compression algorithms such a LZW look for
commonly re-occurring sequences and replace later
copies with pointers to earlier ones. Append new message to list of spam messages and
compress, do the same for ham, and whichever
compresses smaller...
![Page 9: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/9.jpg)
Information Retrieval
Think WWW and search engines Characterized by
Corpus of documents Queries in some query language Result set Presentation of result sort (some ordering)
Methods: Simple Boolean keyword models, IR
scoring functions, PageRank algorithm, HITS
algorithm
![Page 10: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/10.jpg)
IR Scoring Function - BM25
Okapi Project (Robertson, et. al.) Three factors:
Frequency word appears in the document (TF) The inverse document frequency (IDF) – inverse of
times word appears in all documents Length of document
|dj| is the length of the document, L is the average
document length, k and b are tuned parameters
BM25 d j , q1 : N =∑i=1
N
IDF qi∗TF qi , d j∗k1
TF qi , d j k∗1−bb∗∣d j∣/ L
![Page 11: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/11.jpg)
BM25 cont'd.
IDF qi= log N−DFqi0.5
DFq i0.5
![Page 12: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/12.jpg)
Precision and Recall
Precision measures the proportion of the
documents in the result set that are actually
relevant, e.g., if the result set contains 30 relevant
documents and 10 non-relevant documents,
precision is .75 Recall is the proportion of relevant documents that
are in the result set, e.g., if 30 relevant documents
are in the result set out of a possible 50, recall
is .60
![Page 13: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/13.jpg)
IR Refinement
Pivoted document length normalization Longer documents tend to be favored Instead of document length, use a different
normalization function that can be tuned Use word stems Use synonyms Look at metadata
![Page 14: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/14.jpg)
PageRank Algorithm (Google)
Count the links that point to the page Weight links from “high-quality sites” higher
Minimizes the effect of creating lots of pages that point
to the chosen page
where PR(p) is the PageRank of p, N is the total number
of pages in the corpus, xi is a page that link to p, and
C(xi) is the count of the total number of out-links on
the page xi
PR p=1−dN
d∗∑i
PRx iC x i
![Page 15: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/15.jpg)
Information Extraction
Ability to answer questions Possibilities range from simple template matching
to full-blown language understanding systems May be domain specific or general Used as DB front-end, or WWW searching Examples: AskMSR, IBM's Watson, Wolfram
Alpha, Siri
![Page 16: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/16.jpg)
Template Matching
Simple template matching (Weizenbaum's Eliza) Regular Expression matching – finite state
automata Relational extraction methods – FASTUS: Processing done in stages: Tokenization,
Complex-word handling, Basic-group handling,
Complex-phrase handling, Structure merging Each stages uses a FSA
![Page 17: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/17.jpg)
Stochastic Methods for NLP
Probabilistic Context-Free ParsersProbabilistic Lexicalized Context-Free ParsersHidden Markov Models – Viterbi AlgorithmStatistical Decision-Tree Models
![Page 18: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/18.jpg)
Markov Chain
Discrete random process: The system is in various
states and we move from state to state. The
probability of moving to a particular next state (a
transition) depends solely on the current state and
not previous states (the Markov property).May be modeled by a finite state machine with
probabilities on the edges.
![Page 19: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/19.jpg)
Hidden Markov Model
Each state (or transition) may produce an output.The outputs are visible to the viewer, but the
underlying Markov model is not.The problem is often to infer the path through the
model given a sequence of outputs.The probabilities associated with the transitions are
known a priori.There may be more than one start state. The
probability of each start state may also be known.
![Page 20: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/20.jpg)
Uses of HMM
Parts of speech (POS) taggingSpeech recognitionHandwriting recognitionMachine TranslationCryptanalysis Many other non-NLP applications
![Page 21: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/21.jpg)
Viterbi Algorithm
Used to find the mostly likely sequence of states
(the Viterbi path) in a HMM that leads to a given
sequence of observed events.Runs in time proportional to (number of
observations) * (number of states)2.Can be modified if the state depends on the last n
states (instead of just the last state). Take time
(number of observations) * (number of states)n
![Page 22: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/22.jpg)
Viterbi Algorithm - Assumptions
The system at any given time is in one particular
state.There are a finite number of states.Transitions have an associated incremental metric.Events are cumulative over a path, i.e., additive in
some sense.
![Page 23: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/23.jpg)
Viterbi Algorithm - Code
See the
http://en.wikipedia.org/wiki/Viterbi_algorithm.
![Page 24: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/24.jpg)
Example - Using HMMs
Using HMMs to parse seminar announcements Look for different features: Speaker, date, etc. Could use one big HMM for all features or
separate HMMs for each feature Advantages: resistant to noise, can be trained from
data, easily updated Can be used to generate output as well as parse
![Page 25: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/25.jpg)
Example: HMM for speaker recog.
![Page 26: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/26.jpg)
Conditional Random Fields
HMM models the full joint probability of
observations and hidden states – too much work Instead, model the conditional probability of the
hidden attributes given the observations Given a text e
1:N, find the hidden state sequence
X1:N
that maximizes P(X1:N
|e1:N
) Conditional Random Field (CRF) does this Linear Chain CRF: variables in temporal sequence
![Page 27: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/27.jpg)
Automated Template Construction
Start with examples of output, e.g., author-title
pairs Match over large corpus, noting order, and prefix,
suffix, and intermediate text Generate templates from the matches Sensitive to noise
![Page 28: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/28.jpg)
Types of Grammars - Chomsky
Recursively Enumerable: unrestricted rules Context-Sensitive: right-hand side must contain at
least as many symbols as the left-hand side Context-Free: The left-hand side contains a single
symbol Regular Expression: left-hand side is a single non-
terminal, right-hand side is a terminal symbol
optionally followed by a non-terminal symbol
![Page 29: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/29.jpg)
Probabilistic CFG
1. sent <- np, vp. p(sent) = p(r
1) * p(np) * p(vp).
2. np <- noun. p(np) = p(r
2) * p(noun).
....9. noun <- dog. p(noun) = p(dog).
The probabilities are taken from a particular corpus
of text.
![Page 30: Statistical Methods](https://reader030.vdocuments.site/reader030/viewer/2022032612/568131db550346895d98436b/html5/thumbnails/30.jpg)
Probabilistic Lexicalized CFG
1. sent <- np(noun), vp(verb). p(sent) = p(r
1) * p(np) * p(vp)
* p(verb|noun).2. np <- noun. p(np) = p(r
2) * p(noun).
....9. noun <- dog. p(noun) = p(dog).
Note that we've introduced the probability of a
particular verb given a particular noun.