natural language processing - uppsala universitynivre/master/nlp-hmm.pdfnatural language processing...
TRANSCRIPT
Natural Language Processing
Hidden Markov Models
Joakim Nivre
Uppsala UniversityDepartment of Linguistics and Philology
Natural Language Processing 1(10)
Sequence Classification
I Part-of-speech tagging is a sequence classification problemI Input: Sequence w1, . . . ,wn of wordsI Output: Sequence t1, . . . , tn of tags
I Why not just take one word at a time?for i = 1 to n do argmax
tiP(ti |wi , ti�1)
I Because the desired solution may be:argmaxt1,...,tn
P(t1, . . . , tn |w1, . . . ,wn)
I Two approaches:I Local optimization – simple, efficient, but may be suboptimalI Global optimization – may be computationally challenging
Natural Language Processing 2(10)
Hidden Markov Models
I Markov models are probabilistic sequencemodels used for problems such as:
1. Speech recognition2. Spell checking3. Part-of-speech tagging4. Named entity recognition
I A Markov model traverses a sequence of states emitting signalsI If the signal sequence does not uniquely determine the state
sequence, the model is said to be hiddenI Markov models have efficient algorithms for optimization
Natural Language Processing 3(10)
Hidden Markov Models
I A Markov model consists of five elements:1. A finite set of states ⌦ = {s1, . . . , sk}2. A finite signal alphabet ⌃ = {�1, . . . ,�m}3. Initial probabilities P(s) (for every s 2 ⌦) defining the
probability of starting in state s4. Transition probabilities P(si | sj) (for every (si , sj) 2 ⌦2)
defining the probability of going from state sj to state si5. Emission probabilities P(� | s) (for every (�, s) 2 ⌃⇥ ⌦)
defining the probability of emitting symbol � in state s
I We can get rid of initial probabilities by adding a start state
Natural Language Processing 4(10)
Hidden Markov Models
I A Markov model consists of five elements:1. A finite set of states ⌦ = {s1, . . . , sk}2. A finite signal alphabet ⌃ = {�1, . . . ,�m}3. Initial probabilities P(s) (for every s 2 ⌦) defining the
probability of starting in state s4. Transition probabilities P(si | sj) (for every (si , sj) 2 ⌦2)
defining the probability of going from state sj to state si5. Emission probabilities P(� | s) (for every (�, s) 2 ⌃⇥ ⌦)
defining the probability of emitting symbol � in state s
I We can get rid of initial probabilities by adding a start state
Natural Language Processing 4(10)
Hidden Markov Models
I A Markov model consists of four elements:1. A finite set of states ⌦ = {s0, s1, . . . , sk}2. A finite signal alphabet ⌃ = {�1, . . . ,�m} (for every s 2 ⌦)
defining the probability of starting in state s3. Transition probabilities P(si | sj) (for every (si , sj) 2 ⌦2)
defining the probability of going from state sj to state si4. Emission probabilities P(� | s) (for every (�, s) 2 ⌃⇥ ⌦)
defining the probability of emitting symbol � in state s
I We can get rid of initial probabilities by adding a start state
Natural Language Processing 4(10)
Markov Assumptions
I State transitions are assumed to be independent of everythingexcept the current state:
P(s1, . . . , sn) =nY
i=1
P(si | si�1)
I Signal emissions are assumed to be independent of everythingexcept the current state:
P(s1, . . . , sn,�1, . . . ,�n) = P(s1, . . . , sn)nY
i=1
P(�i | si )
Natural Language Processing 5(10)
Tagging with an HMM
I Modeling assumptions:1. States represent tags (or tag sequences)2. Signals represent words
I Probability model (first-order):
P(w1, . . . ,wn, t1, . . . , tn) =nY
i=1
P(wi |ti )P(ti |ti�1)
I An n-th order model conditions tags on n preceding tags
I Ambiguity:I The same word (signal) generated by different tags (states)I Language is ambiguous , Markov model is hidden
Natural Language Processing 6(10)
A Simple HMM
Natural Language Processing 7(10)
A Versatile Model
Problem State Signal
Part-of-speech tagging Tag WordSpeech recognition Phoneme SoundOptical character recognition Character PixelsPredictive text entry Character Keystroke...
......
Natural Language Processing 8(10)
Problems for HMMs
I Decoding = finding the optimal state sequence
argmaxs1
,...,snP(s1, . . . , sn,�1, . . . ,�n)
I Learning = estimating the model parameters
P̂(si |sj) for all si , sj P̂(�|s) for all s,�
Natural Language Processing 9(10)
QuizI In a simple substitution cipher, every letter in the alphabet is replaced by
exactly one symbol and that symbol is never used to represent anotherletter. For example, mummy would be written abaac in a cipher wherem ! a; u ! b; and y ! c.
I In a homophonic substitution cipher, more than one symbol may be usedto replace the same letter but never to represent another letter. Forexample, mummy could be written abcde in a cipher where m ! a, c, ord; u ! b; and y ! e.
I Suppose we use HMMs to relate sequences of plaintext characters andsequences of cipher characters. Which of the following HMMs are hidden?
Cipher State Signal1 Simple Plaintext Cipher2 Simple Cipher Plaintext3 Homophonic Plaintext Cipher4 Homophonic Cipher Plaintext
Natural Language Processing 10(1)
Natural Language Processing
HMM Decoding
Joakim Nivre
Uppsala University
Department of Linguistics and Philology
Natural Language Processing 1(5)
Decoding
IDecoding = finding the optimal state sequence
argmax
s1
,...,snP(s
1
, . . . , sn,�1
, . . . ,�n)
IDifficulty:
IMaximizing over all possible state sequences
IThe number |⌦|n of sequences grows exponentially
IKey observation for dynamic programming:
ISolution of size n contains solution of size n � 1
argmaxs1,...,sn P(s1
, . . . , sn,�1
, . . . ,�n) =
argmaxsn argmaxs1,...,sn�1 P(s1
, . . . , sn�1
,�1
, . . . ,�n�1
)P(sn|sn�1
)P(�n|sn)
Natural Language Processing 2(5)
The Viterbi Algorithm
IBest probability and preceding state at step i and state s:
viterbi(i , s) , maxs0 P(�1
, . . . ,�i , si�1
= s 0, si = s)ptr(i , s) , argmaxs0 P(�1
, . . . ,�i , si�1
= s 0, si = s)
ICompute for all i and s recursively:
viterbi(1, s) = P(s|s0
)P(�1
|s)ptr(1, s) = s
0
viterbi(i , s) = maxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)ptr(i , s) = argmaxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)
IFind best end state and follow pointers backwards:
maxs viterbi(n, s) = maxs1,...,sn P(�1
, . . . ,�n, s1, . . . , sn)
Natural Language Processing 3(5)
The Viterbi Algorithm
IBest probability and preceding state at step i and state s:
viterbi(i , s) , maxs0 P(�1
, . . . ,�i , si�1
= s 0, si = s)ptr(i , s) , argmaxs0 P(�1
, . . . ,�i , si�1
= s 0, si = s)
ICompute for all i and s recursively:
viterbi(1, s) = P(s|s0
)P(�1
|s)ptr(1, s) = s
0
viterbi(i , s) = maxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)ptr(i , s) = argmaxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)
IFind best end state and follow pointers backwards:
maxs viterbi(n, s) = maxs1,...,sn P(�1
, . . . ,�n, s1, . . . , sn)
Natural Language Processing 3(5)
The Viterbi Algorithm
IBest probability and preceding state at step i and state s:
viterbi(i , s) , maxs0 P(�1
, . . . ,�i , si�1
= s 0, si = s)ptr(i , s) , argmaxs0 P(�1
, . . . ,�i , si�1
= s 0, si = s)
ICompute for all i and s recursively:
viterbi(1, s) = P(s|s0
)P(�1
|s)ptr(1, s) = s
0
viterbi(i , s) = maxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)ptr(i , s) = argmaxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)
IFind best end state and follow pointers backwards:
maxs viterbi(n, s) = maxs1,...,sn P(�1
, . . . ,�n, s1, . . . , sn)
Natural Language Processing 3(5)
Viterbi Example
Natural Language Processing 4(5)
Viterbi Example
Natural Language Processing 4(5)
Viterbi Example
Natural Language Processing 4(5)
Viterbi Example
Natural Language Processing 4(5)
Quiz
IConsider the following Viterbi computation:
IWhat is viterbi(4,N)?
1. 2.6 · 10
�9
2. 4.3 · 10
�6
3. 0
IWhat is ptr(4,N)?
1. V
2. N
3. ART
Natural Language Processing 5(5)
Natural Language Processing
HMM Learning
Joakim Nivre
Uppsala University
Department of Linguistics and Philology
Natural Language Processing 1(8)
Learning
I Learning = estimating the model parameters
P̂(si |sj) for all si , sj P̂(�|s) for all s,�
I Supervised learning:I
Given a labeled training corpus, we can estimate parameters
using (smoothed) relative frequencies
I Weakly supervised learning:I
Given a dictionary and an unlabeled training corpus, we can
use Expectation-Maximization to estimate parameters
Natural Language Processing 2(8)
Maximum Likelihood Estimation
I With labeled data D = {(x1, y1), . . . , (xn, yn)}, we maximizethe joint likelihood of input and output:
argmax
✓
nY
i=1
P✓(yi )P✓(xi |yi )
I With unlabeled data D = {x1, . . . , xn}, we can insteadmaximize the marginal likelihood of the input:
argmax✓
nY
i=1
X
y2⌦Y
P✓(y)P✓(xi |y)
I In this case, Y is a hidden variable that is marginalized out
Natural Language Processing 3(8)
Expectation-Maximization
I Goal:
argmax✓
nY
i=1
X
y2⌦Y
P✓(y)P✓(xi |y)
I Problem:I
Maximizing a product of sums is mathematically hard
IIn most cases, there is no closed form solution
I Solution:I
Use numerical approximation methods
IMost common approach: Expectation-Maximization (EM)
Natural Language Processing 4(8)
Expectation-Maximization
I Basic observations:I
If we know f (x , y), we can find the MLE ✓.I
If we know f (x) and the MLE ✓, we can derive f (x , y).I
If P✓(y |x) = p and f (x) = n, then f (x , y) = np.
I Basic idea behind EM:1. Start by guessing ✓2. Compute expected counts E [f (x , y)] = P✓(y |x)f (x)3. Find MLE ✓ given expectation E [f (x , y)]4. Repeat steps 2 and 3 until convergence
Natural Language Processing 5(8)
EM for Hidden Markov Models
I Computing expectations:
E [f (s, s 0)] =Pn�1
i=1 P(�1, . . . ,�n, si = s, si+1 = s
0)
E [f (s,�)] =Pn
i=1 P(�1, . . . ,�i�1,�i = �, . . . ,�n, si = s)
I Difficulty:I
Summing over all possible state sequences
IThe number |⌦|n of sequences grows exponentially
I Sounds familiar?I
We can use dynamic programming again
IThe forward-backward algorithm
Natural Language Processing 6(8)
The Forward-Backward AlgorithmI Probability of signal sequence (Forward):
↵(i , s) , P(�1
, . . . ,�i , si = s)↵(1, s) = P(s|s
0
)P(�1
|s)↵(i , s) =
Ps0 ↵(i � 1, s0)P(s|s0)P(�i |s)P
s ↵(n, s) = P(�1
, . . . ,�n)
I Probability of signal sequence (Backward):
�(i , s) , P(�i+1
, . . . ,�n|si = s)�(n, s) = 1�(i , s) =
Ps0 �(i + 1, s0)P(s0|s)P(�i+1
|s0)Ps �(1, s)P(s|s
0
)P(�1
|s) = P(�1
, . . . ,�n)
I Expected counts of hidden states (Forward-Backward):
E [C(s, s0)] =Pn
i=1
↵(i , s)P(s0|s)P(�i+1
|s0)�(i + 1, s0)E [C(s,�)] =
Pi :�i=� ↵(i , s)P(�i |s)�(i , s)
Natural Language Processing 7(8)
Quiz
I Assume we have the following tagged corpus:a/DET can/NOUN can/AUX trash/VERB a/DET trash/NOUN can/NOUN
I What is the MLE of P(AUX|NOUN)?1. 0.02. 0.53. 1.0
I What is the MLE of P(trash|NOUN)?1. 0.332. 0.53. 0.67
Natural Language Processing 8(8)