sequential data modeling - ahcweb01.naist.jp€¢ training algorithm ... david j.c. mackay,...
TRANSCRIPT
Sequential Data ModelingSequential Data Modelingq gq g
Tomoki TodaTomoki TodaGraham Neubig(Sakriani Sakti)
Augmented Human Communication LaboratoryGraduate School of Information Science
Course GoalsThis aim of this course is to learn basic knowledge of sequential data modeling techniques that can be applied to sequential data such as speech signals, biological signals, videos of moving objects,
l l I i l i ill f d ior natural language text. In particular, it will focus on deepening knowledge of methods based on probabilistic models, such as hidd M k d l li d i l thidden Markov models or linear dynamical systems.
a
Credits and Grading• 1 credit course
• Score will be graded byg y• Assignment report in every class
• Prerequisites• Fundamental Mathematics for Optimization (最適化数学基礎)• Fundamental Mathematics for Optimization (最適化数学基礎)
• Calculus (微分積分学)
• Basic Data Analysis (データ解析基礎)• Basic Data Analysis (デ タ解析基礎)
b
Materials• Textbook
• There is no textbook for this course.
• Lecture slides• Handout will be distributed in each class.Handout will be distributed in each class.• PDF slides are availabe from
http://ahclab.naist.jp/lecture/2015/sdm/index.htmlhttp://ahclab.naist.jp/lecture/2015/sdm/index.html(internal access only)
• Reference materialsC M Bi h P tt R iti d M hi L i S i S i• C.M. Bishop: Pattern Recognition and Machine Learning, Springer Science + Business Media, LLC, 2006
• C M ビショップ(著) 元田 栗田 樋口 松本 村田(訳):パターン認識と
c
• C.M. ビショップ(著)、元田、栗田、樋口、松本、村田(訳):パタ ン認識と機械学習 上・下、シュプリンガー・ジャパン、2008
Office Hours• Person in charge: Tomoki Toda
Augmented Human Communication Laboratory• Office: B713• Office hour: by appointment (send an email first)• Email: tomoki@is naist jp• Email: [email protected]
• Other lecturerG h N bi• Graham Neubig• Email: [email protected] Office: B714 (AHC Lab)
d
TA Members Email: [email protected]• Kyoshiro Sugiyama [email protected]• Akiba Miura [email protected]• Kohei Mukaihara [email protected]@ jp
Sugiyama‐kun Miura‐kun Mukaihara‐kun
e
Schedule• 1st slot on every Tuesday 9:20‐10:50 at L2 room
June July
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6
Sun Mon Tue Wed Thu Fri Sat
1 2 3 41 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
21 22 23 24 25 26 27
28 29 30
19 20 21 22 23 24 25
26 27 28 29 30 31
f
Syllabus
D C d i i LDate Course description Lecturer6/09 Basics of sequential data modeling 1 Tomoki Toda6/16 Basics of sequential data modeling 2 Tomoki Toda6/23 Discrete latent variable models 1 Tomoki Toda6/30 Discrete latent variable models 2 Tomoki Toda7/07 Discriminative models for sequential labeling 1 Graham Neubig7/14 Discriminative models for sequential labeling 2 Graham Neubig7/21 Continuous latent variable models 1 Tomoki Toda7/28 Continuous latent variable models 2 Tomoki Toda
g
1st and 2nd Classes (6/09 and 6/16)• Lecturer: Tomoki Toda
• Contents: Basics of sequential data modelingq g• Markov process• Latent variables• Latent variables• Mixture models
( )• Expectation‐maximization (EM) algorithm
1z 2z 3z 4z
1x 2x 3x 4x1x 2x 3x 4x
h
3rd and 4th Classes (6/23 and 6/30)• Lecturer: Tomoki Toda
• Contents: Discrete latent variable models• Hidden Markov models• Forward backward algorithm• Forward‐backward algorithm• Viterbi algorithm• Training algorithm
1
1 2 3/s/3 2
/ /
i
5th and 6th Classes (7/07 and 7/14)• Lecturer: Graham Neubig
• Contents: Discriminative models for sequential labelingq g• Structured perceptron• Conditional Random Fields• Conditional Random Fields• Training algorithm
「今日は晴れだ。」
今日/は/晴れ/だ/。
今/日/は/晴れ/だ/。
j
7th and 8th Classes (7/21 and 7/28)• Lecturer: Tomoki Toda
• Contents: Continuous latent variable models• Factor analysis• Linear dynamical systems• Linear dynamical systems• Prediction and update
( )• (Training algorithm)1z
1z 2z 3z 4zNz
x
1x 2x 3x 4x
n
k
n
Sequential Data ModelingSequential Data Modelingq gq g
11stst classclass“Basics of sequential data modeling 1”“Basics of sequential data modeling 1”
Tomoki Toda
Augmented Human Communication LaboratoryGraduate School of Information Science
Sequential Data• Data examples
• Time series (speech, actions, moving objects, exchange rates, …)• Character strings (word sequence, symbol string, …)
• Various lengths of data• E.g.,
l (l h ) { }g ,
Data sample 1 (length = 5): { 1, 0, 1, 1, 0 }Data sample 2 (length = 8): { 1, 1, 1, 0, 1, 1, 0, 0 }Data sample 3 (length = 3): { 0 0 1 }Data sample 3 (length = 3): { 0, 0, 1 }Data sample 4 (length = 6): { 0, 1, 0, 1, 1, 0 }
• Probabilistic approach to modeling sequential data• Consistent framework for the quantification and manipulation of
uncertainty• Effective for dealing with real data
1
How to Represent Sequential Data?• A sequential data sample is represented in a high‐dimensional
space (“# of dimensions” = “length of the sequential data sample”).
Examples of sequential data:
x1 x2 x1 x2 x3 x1 x2 x3 xN…
Length = 2 Length = 3 Length = N
n = 1 n = 2 n = 1 n = 2 n = 3 n = 1 n = 2 n = 3 n = N…x1 x2 x1 x2 x3 x1 x2 x3 xN
x2
x2
x3
x1x1
Represented by Represented by Represented byRepresented by2‐dimensional vector
Represented by3‐dimensional vector
Represented byN‐dimensional vector
2
We need to model probability distribution in these high‐dimensional spaces!
Rules of Probability (1)• Assume two random variables, X and Y
• X : x1 = “Bread”, x2 = “Rice”, or x3 = “Noodle”• Y : y1 = “Home” or y2 = “Restaurant”
• Assume the following data samples { X, Y }:{Bread, Home}, {Rice, Restaurant}, {Noodle, Home}, {Bread, Restaurant},{ } { } { } { }{Rice, Restaurant}, {Noodle, Home}, {Bread, Home}, {Rice, Home},and {Bread, Home}
• Make the following table showing the number of samplesNumber of samples
2 1 2
1 2 0Home
Restaurant
p# of samples of {Noodle, Home}
1 2 0
Bread Rice NoodleRestaurant
3
Rules of Probability (2)# of samples in the ith column
# f l i h di ll
ic
# of samples in the jth row
# of samples in the corresponding cell
The jth value of ijn jrjy
The ith value of a random variable X, h i 1 M
a random variable Y, where j = 1, …, L
ix
JJoint probabilityoint probability ::
where i = 1, …, M
MMarginal probabilityarginal probability ::
SSum rule of probabilityum rule of probability ::
CConditional probabilityonditional probability ::CConditional probabilityonditional probability ::
Product ruleProduct rule of probability:of probability:p yp y
4
Rules of Probability (3)• The rules of probability
–– Sum ruleSum rule ::
–– Product ruleProduct rule ::
•• BayesBayes’ ’ theoremtheorem ::
5
Probability Densities• Probabilities with respect to continuous variables• Probability density over a real‐valued variable x : p(x)
– Probability that x will lie in an interval (a, b) :
– Conditions to be satisfied :
• Cumulative distribution function: P(x)– Probability that x lies in the interval (-∞, z) :y ( , )
6
How to Model Joint Probability?• Length of sequential data (# of data points over a sequence) varies…
i # f di i f j i b bili di ib i l ii.e., # of dimensions of joint probability distribution also varies…• Joint probability distribution can be represented with conditional probability
di t ib ti f i di id l d t i t !distributions of individual data points! i.e., # of distributions varies but # of dimensions of each distribution is fixed.
• However conditional probability distribution of a present data point given• However, conditional probability distribution of a present data point given all past data points needs to be modeled…
112131211 ,,|,||,, NNN xxxpxxxpxxpxpxxp
N
xxxpxp |
n
nn xxxpxp2
111 ,,|
How can we effectively model joint probability distribution 1x 2x 3x 4x 5x
of sequential data?7
Two Basic Approaches
• Markov processp
1x 2x 3x 4x
• Latent variablesLatent variables
1z 2z 3z 4z1z 2z 3z 4z
1x 2x 3x 4x1x 2x 3x 4x
8
Markov Process• Assume that the conditional probability distribution of the present states
d d l fdepends only on a few past states
e.g., it depends on only one past state
1st order Markov chain
only one past state…
1x 2x 3x 4x
2nd order Markov chain
1x 2x 3x 4x9
Example of 1st Order Markov Process• How many probability distributions are needed if we model English text
i h 1 t d M k ?using the 1st order Markov process?If only using 27 characters including “space”, P(“This sentence is represented by this …”)= P(T) P(h|T) P(i|h) P(s|i) P(‐|s) P(s|‐) P(e|s) P(n|e) P(t|n) P(e|t) …
P(x) P(x, y) P(y|x)x : 1st letter
2nd l tty : 2nd letter
Probability is h b hshown by the areas of white squares.
10
David J.C. MacKay, “Information Theory, Inference, and Learning Algorithms,” Cambridge University Press, pp. 22-24
State Transition Diagram/MatrixState transition diagram State transition matrix
P
晴
0.6
0 1晴
晴 曇 雨 )|( 晴雨pw/o explicit initial and final states
Present state
雨 曇
4.03.03.01.03.06.0
0.30.1 0.3
0.1晴
A 曇
Past state)|( 晴雨p
雨 曇 4.05.01.00.30.4
0.50.4
雨
Sum becomes one.
/s/ Initial state /s/ /e/起 寝w/ explicit initial
State transition diagram State transition matrix
/s/
0.61
20107000010
Initial state/s/
起
/s/ /e/起 寝w/ explicit initial and final states
起 寝 0.30.7 0.1
0.2
10001.03.06.002.01.07.00
0.1 /e/
起
寝A ??
/e/ 1000Final state
/e/
Final state transition 11
Example of Applications• Language model (modeling discrete variables)
• Modeling a word (morpheme) sequence with Markov model (n‐gram)e.g., 「学校に行く」 /s/, 学校, に, 行, く, /e/
Decompose into
)|(|)|()|()|()( く行)(くに行学校に学校 eppppspsp
Decompose intomorphemes Modeling with bi‐gram (2‐gram)
( )
)|(|)|()|()|()( く行)(くに行学校に学校 eppppspsp
• Autoregressive model (modeling continuous variables)• Predicting the present data point from past M data points
Linear combination of t M d t i tpast M data points
12
Model Training (Maximum Likelihood Estimation)
• Training of conditional probability distribution from sequential data l i i i dsamples given as training data …
Likelihood function:
Model parameter setFunction of model parameters
Determine the conditional probability distributions that maximizes the (log‐scaled) likelihood function
),|( λpc wwp
C t i t t lisubject to Constraint to normalize the estimates as probability
Maximum likelihood estimate:# of samples cp ww ,
# of samples pw13
Example of MLE
Log of A‐san’s menu MLE of conditional probabilitiesLog of A san s menu
P( | ) = 1/4??
A san
P( | ) = 1/4P( | ) = 3/4
|
??????A‐san P( | ) = 2/4
P( | ) = 1/4P( | ) 1/4
????
P( | ) = 1/4
P( | ) = 1/8????
???? ( | )P( | ) = 4/8P( | ) = 3/8
1/4 3/4
2/41/4
4/8
????
??????
???? ( | )2/4
1/41/8
4/8
3/8
????
????
????
State transition diagram
14
1/4 3/8???? State transition diagram
Methods for Evaluating Models • Use of a test data set {w1, …, wN} not included in training data• Evaluation metrics
• Likelihood N
|| λ
• Log‐scaled likelihood
N
|l|l λ
n
nnN wwpwwp1
11 ||,, λ
• Entropy
n
nnN wwpwwp1
1212 |log|,,log λ
N
wwpwwpH |log1|log1 λ
• Perplexity HPP 2
n
nnN wwpN
wwpN
H1
1212 |log|,,log λ
PP 2
A measure of effective “branching factor”e.g., set a uniform distribution to all n‐gram probabilities for M words
MMNMN
H NN
n
22
12 loglog11log1
e.g., set a uniform distribution to all n gram probabilities for M words
MPP
15
Classification/Inference/GenerationClassification
l l d l
Sequential data
Multiple models
Model selection based onq Model selection based on maximum posterior probability Classification result
( )Model
Inference (prediction)
Calculation of conditional probability
Inference (data or probability)Sequential data
Generation
Model Data generation based on joint probability distribution
Data
Generation
j p y
16
Classification w/ Maximum A Posteriori• Select a model that maximizes the posterior probability
Likelihood function Prior probabilityPosterior probability:
Constant
Th d l i l t d b
Niixxpi λ ,,|maxargˆ
1
The model is selected by
iiNi
i
pxxp λλ|,,maxarg 1
iNxxpi λ|,,maxargˆ1
If prior probability is given by a uniform distribution,
17
iNixxpi λ|,,maxarg 1
Classification• Comparison of model likelihoods 起: wake up
寝: sleep
/s/1
/s/1
Model 1 Model 2Models:
寝: sleep
起 寝
0.80.1
1
0.7 0.1 起 寝
0.40.4
1
0.6 0.2
/e/0.2 0.1
/e/0.2 0.2
Observed data: /s/ 起起寝起 /e/ Q. Which model is this data sample classified to?
)|()|()|()|()|()( 起eppppspsp 寝起起寝起起起
/s/ 起 起 寝 起 /e/
Likelihood:
Trellis:
0112.02.08.01.07.011 009602040206011
/s/ 起 起 寝 起 /e/Trellis:Expansion of the state transition graph over time axis モデル1:
モデル2: ?? 0096.02.04.02.06.011 モデル2:
A. Classified to the model 1.18
??
Marginalization for Unobserved Data• Likelihood calculation with marginalization even if a part of data is
not observed. Observed data: /s/ ?寝 ? /e/ Q. Which model is this data sample classified to?
寝起
寝寝,,
331131
)|()|()|()|()(xx
xepxpxpsxpspLikelihood:
寝起寝起
寝寝,
33,
1131
)|()|()|()|()(xx
xepxpxpsxpspConsider all possible data samples
Trellis:
Model 1
data samples
/ / 起 起 / /0 41 0 20.2
/ / 起 起 / /0 81 0 10.2
Model 2
Trellis: /s/ 起
寝
起 /e/
寝 寝
0.4
0 4
0.20
0 4
0.2
/s/ 起
寝
起 /e/
寝 寝
0.8
0 1
0.10
0 1
0.1
017010102080
1.001.011
0320204020404.002.011
0.4 0.40.1 0.1
?? 017.01.01.02.08.0 032.02.04.02.04.0
A. Classified to the model 2.19
Inference• Calculation of posterior probability /s/
1Model 1
Observed data: /s/ 起 ?起 /e/
起 寝
0.81
0.1Q. Which “起” or “寝” is likely observed at “?” point?
起起起
起起起),,,,()|( espxxxp /e/
0.10.70.1
0.2 0.1Posterior probability:
寝起
起起起起起
,2
312
2
),,,,(),|(
xexsp
xxxp
)( esp 起起起098.02.07.07.011 /s/ 起 起 起 /e/
),,,,( esp 起起起
起寝起0 8
1 0.7
0 1
0.20.7
0980
寝016.02.08.01.011
),,,,( esp 起寝起0.80.1
??860.0
016.0098.0098.0
0160
),|( 312 起起起 xxxp
A. “?” is “起” with140.0
016.0098.0016.0
),|( 312 起起寝 xxxp 86% of probability.
20
??
Generation• Random generation of data samples from the model
/s/1
Model 1/s/
1x1. Sampling /s/
起 寝
0.80.1
1
0.7 0.1
/s/
2x2. Sampling
/ /
1
/e/0.2 0.1起
2p g起
1
e.g., … 寝3x3. Sampling
起 /e/寝
4x4. Sampling /e/寝
0.7 0.2 0.1
/s/起寝起 ・・・ /e/
Various lengths of data are
e.g., … 起 起 /e/寝
0.8 0.10.1
/s/起寝起 ・・・ /e/
ggenerated.
End if the final state /e/ is sampled21
Maximum Likelihood Data Generation• Data generation by maximizing likelihood under the condition that
the length of data is given /s/1
Model 1Dynamic programming
起 寝
0.80.10.7 0.1
1. Store the best path at each state
1・0 7 0 7・0 7/s/ 1 1
0 70.7
0 70.49
/e/0.2 0.1
1 0.70・0.8Selection
0.7 0.70.1・0.8Selection
起 起 起
0
0.7
0.1
0.7
0 2
0.12. Back track of the best
寝 寝寝
00.10.8
0 10.10.8
0 07
0.2
0.1 0.098path from the final state
/e/0 0.1 0.07
1・0.10・0.1Selection
0.7・0.10.1・0.1Selection
0.49・0.20.07・0.1
Selection
/s/ 起起起 /e/ is generated if setting the data length to 3.22