cs774. markov random field : theory and application lecture 19 kyomin jung kaist nov 12 2009
TRANSCRIPT
![Page 1: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/1.jpg)
CS774. Markov Random Field : Theory and Application
Lecture 19
Kyomin JungKAIST
Nov 12 2009
![Page 2: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/2.jpg)
Sequence Labeling Problem
Many NLP problems can viewed as se-quence labeling.
Each token in a sequence is assigned a label.
Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors (not i.i.d).
![Page 3: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/3.jpg)
Part Of Speech (POS) Tagging Annotate each word in a sentence with
a part-of-speech. Lowest level of syntactic analysis.
Useful for subsequent syntactic parsing and word sense disambiguation.
John saw the saw and decided to take it to the table.PN V Det N Con V Part V Pro Prep Det N
![Page 4: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/4.jpg)
Bioinformatics
Sequence labeling also valuable in labeling genetic sequences in genome analysis.
extron intronAGCTAACGTTCGATACGGATTACAGCCT
![Page 5: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/5.jpg)
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
PN
![Page 6: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/6.jpg)
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
V
![Page 7: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/7.jpg)
Probabilistic Sequence Models
Probabilistic sequence models allow inte-grating uncertainty over multiple, interde-pendent classifications and collectively de-termine the most likely global assignment.
Two standard modelsHidden Markov Model (HMM)Conditional Random Field (CRF)
![Page 8: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/8.jpg)
Hidden Markov Model Probabilistic generative model for se-
quences. A finite state machine with probabilistic
transitions and probabilistic generation of outputs from states.
Assume an underlying set of states in which the model can be.
Assume probabilistic transitions between states over time.
Assume a probabilistic generation of to-kens from states.
![Page 9: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/9.jpg)
Sample HMM for POS
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
![Page 10: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/10.jpg)
Sample HMM Generation
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
![Page 11: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/11.jpg)
Sample HMM Generation
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John
![Page 12: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/12.jpg)
Sample HMM Generation
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John
![Page 13: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/13.jpg)
Sample HMM Generation
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit
![Page 14: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/14.jpg)
Sample HMM Generation
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit
![Page 15: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/15.jpg)
Sample HMM Generation
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the
![Page 16: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/16.jpg)
Sample HMM Generation
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the
![Page 17: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/17.jpg)
Sample HMM Generation
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the apple
![Page 18: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/18.jpg)
Sample HMM Generation
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
athea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the apple
![Page 19: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/19.jpg)
Three Useful HMM Tasks Observation likelihood: To classify se-
quences.
Most likely state sequence: To tag each token in a sequence with a label.
Maximum likelihood training: To train models to fit empirical training data.
![Page 20: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/20.jpg)
Observation Likelihood
Given a sequence of observations, O, and a model with a set of parameters, λ, what is the probability that this observation was generated by this model: P(O| λ) ?
Allows HMM to be used as a language model: A formal probabilistic model of a language that assigns a probability to each string saying how likely that string was to have been generated by the language.
Useful for two tasks:Sequence ClassificationMost Likely Sequence
![Page 21: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/21.jpg)
Sequence Classification Assume an HMM is available for each
category (i.e. language). What is the most likely category for a
given observation sequence, i.e. which category’s HMM is most likely to have generated it?
Used in speech recognition to find most likely word model to have generate a given sound or phoneme sequence.
Austin Boston
? ?
P(O | Austin) > P(O | Boston) ?
ah s t e n
O
![Page 22: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/22.jpg)
Most Likely State Sequence Given an observation sequence, O, and a
model, λ, what is the most likely state se-quence,Q=Q1,Q2,…QT, that generated this sequence from this model?
Used for sequence labeling.
John gave the dog an apple.
![Page 23: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/23.jpg)
Observation LikelihoodEfficient Solution Markov assumption: Probability of the cur-
rent state only depends on the immedi-ately previous state, not on any earlier his-tory (via the transition probability distribu-tion, A).
Forward-Backward Algorithm: Uses dy-namic programming to exploit this fact to efficiently compute observation likelihood in O(N2T) time. (N: # of words, T: # of to-kens)
![Page 24: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/24.jpg)
Maximum Likelihood Training
Given an observation sequence, O, what set of parameters, λ, for a given model maximizes the probability that this data was generated from this model (P(O| λ))?
Only need to have an unannotated obser-vation sequence (or set of sequences) generated from the model. In this sense, it is unsupervised.
![Page 25: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/25.jpg)
Supervised HMM Training If training sequences are labeled
(tagged) with the underlying state se-quences that generated them, then the parameters, λ={A,B,π} can all be esti-mated directly from counts accumulated from the labeled sequences (with ap-propriate smoothing).
SupervisedHMM
Training
John ate the appleA dog bit MaryMary hit the dogJohn gave Mary the cat.
.
.
.
Training Sequences
Det Noun PropNoun Verb
![Page 26: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/26.jpg)
Generative vs. Discriminative Sequence Labeling Models
HMMs are generative models and are not directly designed to maximize the per-formance of sequence labeling. They model the joint distribution P(O,Q).
Conditional Random Field (CRF) is specifically designed and trained to max-imize performance of sequence labeling. They model the conditional distribution P(Q | O)
![Page 27: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/27.jpg)
Definition of CRF
![Page 28: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/28.jpg)
An Example of CRF
![Page 29: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/29.jpg)
Sequence Labeling
Y2
X1 X2 … XT
HMM
Linear-chain CRF
Generative
Discriminative
Y1 YT..
Y2
X1 X2 … XT
Y1 YT..
![Page 30: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/30.jpg)
Conditional Distribution forLinear Chain CRF
A typical form of CRF for sequence la-beling is
T
t
K
ktttkk XYYf
XZXYP
1 11 )),,(exp(
)(
1)|(
Y
T
t
K
ktttkk XYYfXZ
1 11 )),,(exp()(
![Page 31: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/31.jpg)
CRF experimental Results
Experimental results verify that they have superior accuracy on various sequence la-beling tasks. Part of Speech tagging Noun phrase chunking Named entity recognition …
However, CRFs are much slower to train and do not scale as well to large amounts of training data. Training for POS on full Penn Treebank (~1M
words) currently takes “over a week.”
![Page 32: CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov 12 2009](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649f045503460f94c18fcf/html5/thumbnails/32.jpg)
CRF Summary CRF is a discriminative approach to sequence
labeling whereas HMMs are generative. Discriminative methods are usually more accu-
rate since they are trained for a specific per-formance task.
CRF also easily allows adding additional token features without making additional indepen-dence assumptions.
Training time is increased since a complex op-timization procedure is needed to fit super-vised training data.