sequential data modeling - ahcweb01.naist.jp€¢ training algorithm ... david j.c. mackay,...

35
Sequential Data Modeling Sequential Data Modeling Tomoki Toda Tomoki Toda Graham Neubig (Sakriani Sakti) Augmented Human Communication Laboratory Graduate School of Information Science

Upload: lyanh

Post on 12-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Sequential Data ModelingSequential Data Modelingq gq g

Tomoki TodaTomoki TodaGraham Neubig(Sakriani Sakti)

Augmented Human Communication LaboratoryGraduate School of Information Science

Course GoalsThis aim of this course is to learn basic knowledge of sequential data modeling techniques that can be applied to sequential data such as speech signals, biological signals, videos of moving objects, 

l l I i l i ill f d ior natural language text. In particular, it will focus on deepening knowledge of methods based on probabilistic models, such as hidd M k d l li d i l thidden Markov models or linear dynamical systems.

a

Credits and Grading• 1 credit course

• Score will be graded byg y• Assignment report in every class

• Prerequisites• Fundamental Mathematics for Optimization (最適化数学基礎)• Fundamental Mathematics for Optimization (最適化数学基礎)

• Calculus (微分積分学)

• Basic Data Analysis (データ解析基礎)• Basic Data Analysis (デ タ解析基礎)

b

Materials• Textbook

• There is no textbook for this course. 

• Lecture slides• Handout will be distributed in each class.Handout will be distributed in each class.• PDF slides are availabe from 

http://ahclab.naist.jp/lecture/2015/sdm/index.htmlhttp://ahclab.naist.jp/lecture/2015/sdm/index.html(internal access only)

• Reference materialsC M Bi h P tt R iti d M hi L i S i S i• C.M. Bishop: Pattern Recognition and Machine Learning, Springer Science + Business Media, LLC, 2006

• C M ビショップ(著) 元田 栗田 樋口 松本 村田(訳):パターン認識と

c

• C.M. ビショップ(著)、元田、栗田、樋口、松本、村田(訳):パタ ン認識と機械学習 上・下、シュプリンガー・ジャパン、2008

Office Hours• Person in charge: Tomoki Toda

Augmented Human Communication Laboratory• Office: B713• Office hour: by appointment (send an email first)• Email: tomoki@is naist jp• Email: [email protected]

• Other lecturerG h N bi• Graham Neubig• Email: [email protected] Office: B714 (AHC Lab)

d

TA Members Email: [email protected]• Kyoshiro Sugiyama [email protected]• Akiba Miura [email protected]• Kohei Mukaihara [email protected]@ jp

Sugiyama‐kun Miura‐kun Mukaihara‐kun

e

Schedule• 1st slot on every Tuesday 9:20‐10:50 at L2 room 

June July

Sun Mon Tue Wed Thu Fri Sat

1 2 3 4 5 6

Sun Mon Tue Wed Thu Fri Sat

1 2 3 41 2 3 4 5 6

7 8 9 10 11 12 13

14 15 16 17 18 19 20

1 2 3 4

5 6 7 8 9 10 11

12 13 14 15 16 17 18

21 22 23 24 25 26 27

28 29 30

19 20 21 22 23 24 25

26 27 28 29 30 31

f

Syllabus

D C d i i LDate Course description Lecturer6/09 Basics of sequential data modeling 1 Tomoki Toda6/16 Basics of sequential data modeling 2 Tomoki Toda6/23 Discrete latent variable models 1 Tomoki Toda6/30 Discrete latent variable models 2 Tomoki Toda7/07 Discriminative models for sequential labeling 1 Graham Neubig7/14 Discriminative models for sequential labeling 2 Graham Neubig7/21 Continuous latent variable models 1 Tomoki Toda7/28 Continuous latent variable models 2 Tomoki Toda

g

1st and 2nd Classes (6/09 and 6/16)• Lecturer: Tomoki Toda

• Contents: Basics of sequential data modelingq g• Markov process• Latent variables• Latent variables• Mixture models

( )• Expectation‐maximization (EM) algorithm

1z 2z 3z 4z

1x 2x 3x 4x1x 2x 3x 4x

h

3rd and 4th Classes (6/23 and 6/30)• Lecturer: Tomoki Toda

• Contents: Discrete latent variable models• Hidden Markov models• Forward backward algorithm• Forward‐backward algorithm• Viterbi algorithm• Training algorithm

1

1 2 3/s/3 2

/ /

i

5th and 6th Classes (7/07 and 7/14)• Lecturer: Graham Neubig

• Contents: Discriminative models for sequential labelingq g• Structured perceptron• Conditional Random Fields• Conditional Random Fields• Training algorithm

「今日は晴れだ。」

今日/は/晴れ/だ/。

今/日/は/晴れ/だ/。

j

7th and 8th Classes (7/21 and 7/28)• Lecturer: Tomoki Toda

• Contents: Continuous latent variable models• Factor analysis• Linear dynamical systems• Linear dynamical systems• Prediction and update

( )• (Training algorithm)1z

1z 2z 3z 4zNz

x

1x 2x 3x 4x

n

k

n

Sequential Data ModelingSequential Data Modelingq gq g

11stst classclass“Basics of sequential data modeling 1”“Basics of sequential data modeling 1”

Tomoki Toda

Augmented Human Communication LaboratoryGraduate School of Information Science

Sequential Data• Data examples

• Time series (speech, actions, moving objects, exchange rates, …)• Character strings (word sequence, symbol string, …)

• Various lengths of data• E.g., 

l (l h ) { }g ,

Data sample 1 (length = 5): { 1, 0, 1, 1, 0 }Data sample 2 (length = 8): { 1, 1, 1, 0, 1, 1, 0, 0 }Data sample 3 (length = 3): { 0 0 1 }Data sample 3 (length = 3): { 0, 0, 1 }Data sample 4 (length = 6): { 0, 1, 0, 1, 1, 0 }

• Probabilistic approach to modeling sequential data• Consistent framework for the quantification and manipulation of 

uncertainty• Effective for dealing with real data

1

How to Represent Sequential Data?• A sequential data sample is represented in a high‐dimensional 

space (“# of dimensions” = “length of the sequential data sample”).

Examples of sequential data:

x1 x2 x1 x2 x3 x1 x2 x3 xN…

Length = 2 Length = 3 Length = N

n = 1 n = 2 n = 1 n = 2 n = 3 n = 1 n = 2 n = 3 n = N…x1 x2 x1 x2 x3 x1 x2 x3 xN

x2

x2

x3

x1x1

Represented by Represented by Represented byRepresented by2‐dimensional vector

Represented by3‐dimensional vector

Represented byN‐dimensional vector

2

We need to model probability distribution in these high‐dimensional spaces!

Rules of Probability (1)• Assume two random variables, X and Y

• X : x1 = “Bread”, x2 = “Rice”, or x3 = “Noodle”• Y : y1 = “Home” or y2 = “Restaurant”

• Assume the following data samples { X, Y }:{Bread, Home}, {Rice, Restaurant}, {Noodle, Home}, {Bread, Restaurant},{ } { } { } { }{Rice, Restaurant}, {Noodle, Home}, {Bread, Home}, {Rice, Home},and {Bread, Home}

• Make the following table showing the number of samplesNumber of samples

2 1 2

1 2 0Home

Restaurant

p# of samples of {Noodle, Home}

1 2 0

Bread Rice NoodleRestaurant

3

Rules of Probability (2)# of samples in the ith column

# f l i h di ll

ic

# of samples in the jth row

# of samples in the corresponding cell

The jth value of ijn jrjy

The ith value of a random variable X, h i 1 M

a random variable Y, where j = 1, …, L

ix

JJoint probabilityoint probability ::

where i = 1, …, M

MMarginal probabilityarginal probability ::

SSum rule of probabilityum rule of probability ::

CConditional probabilityonditional probability ::CConditional probabilityonditional probability ::

Product ruleProduct rule of probability:of probability:p yp y

4

Rules of Probability (3)• The rules of probability

–– Sum ruleSum rule ::

–– Product ruleProduct rule ::

•• BayesBayes’ ’ theoremtheorem ::

5

Probability Densities• Probabilities with respect to continuous variables• Probability density over a real‐valued variable x : p(x)

– Probability that x will lie in an interval (a, b) :

– Conditions to be satisfied :

• Cumulative distribution function: P(x)– Probability that x lies in the interval (-∞, z) :y ( , )

6

How to Model Joint Probability?• Length of sequential data (# of data points over a sequence) varies…  

i # f di i f j i b bili di ib i l ii.e., # of dimensions of joint probability distribution also varies…• Joint probability distribution can be represented with conditional probability 

di t ib ti f i di id l d t i t !distributions of individual data points!                                                                 i.e., # of distributions varies but # of dimensions of each distribution is fixed.

• However conditional probability distribution of a present data point given• However, conditional probability distribution of a present data point given all past data points needs to be modeled…

112131211 ,,|,||,, NNN xxxpxxxpxxpxpxxp

N

xxxpxp |

n

nn xxxpxp2

111 ,,|

How can we effectively model joint probability distribution 1x 2x 3x 4x 5x

of sequential data?7

Two Basic Approaches

• Markov processp

1x 2x 3x 4x

• Latent variablesLatent variables

1z 2z 3z 4z1z 2z 3z 4z

1x 2x 3x 4x1x 2x 3x 4x

8

Markov Process• Assume that the conditional probability distribution of the present states  

d d l fdepends only on a few past states

e.g., it depends on only one past state

1st order Markov chain

only one past state…

1x 2x 3x 4x

2nd order Markov chain

1x 2x 3x 4x9

Example of 1st Order Markov Process• How many probability distributions are needed if we model English text 

i h 1 t d M k ?using the 1st order Markov process?If only using 27 characters including “space”, P(“This sentence is represented by this …”)= P(T) P(h|T) P(i|h) P(s|i) P(‐|s) P(s|‐) P(e|s) P(n|e) P(t|n) P(e|t) …

P(x) P(x, y) P(y|x)x : 1st letter

2nd l tty : 2nd letter

Probability is h b hshown by the areas of white squares.

10

David J.C. MacKay, “Information Theory, Inference, and Learning Algorithms,” Cambridge University Press, pp. 22-24

State Transition Diagram/MatrixState transition diagram State transition matrix

P

0.6

0 1晴

晴 曇 雨 )|( 晴雨pw/o explicit initial and final states

Present state

雨 曇

4.03.03.01.03.06.0

0.30.1 0.3

0.1晴

A 曇

Past state)|( 晴雨p

雨 曇 4.05.01.00.30.4

0.50.4

Sum becomes one.

/s/ Initial state /s/ /e/起 寝w/ explicit initial

State transition diagram State transition matrix

/s/

0.61

20107000010

Initial state/s/

/s/ /e/起 寝w/ explicit initial and final states

起 寝 0.30.7 0.1

0.2

10001.03.06.002.01.07.00

0.1 /e/

寝A ??

/e/ 1000Final state

/e/

Final state transition 11

Example of Applications• Language model (modeling discrete variables)

• Modeling a word (morpheme) sequence with Markov model (n‐gram)e.g., 「学校に行く」 /s/, 学校, に, 行, く, /e/

Decompose into

)|(|)|()|()|()( く行)(くに行学校に学校 eppppspsp

Decompose intomorphemes Modeling with bi‐gram (2‐gram)

( )

)|(|)|()|()|()( く行)(くに行学校に学校 eppppspsp

• Autoregressive model (modeling continuous variables)• Predicting the present data point from past M data points

Linear combination of t M d t i tpast M data points

12

Model Training (Maximum Likelihood Estimation)

• Training of conditional probability distribution from sequential data l i i i dsamples given as training data                         … 

Likelihood function:

Model parameter setFunction of model parameters

Determine the conditional probability distributions                            that maximizes the (log‐scaled) likelihood function

),|( λpc wwp

C t i t t lisubject to Constraint to normalize      the estimates as probability

Maximum likelihood estimate:# of samples  cp ww ,

# of samples  pw13

Example of MLE

Log of A‐san’s menu MLE of conditional probabilitiesLog of A san s menu

P( | ) = 1/4??

A san

P(        |       ) = 1/4P(        |       ) = 3/4

|

??????A‐san P(        |       ) = 2/4

P(        |       ) = 1/4P( | ) 1/4

????

P(        |       ) = 1/4

P(        |       ) = 1/8????

???? ( | )P(        |       ) = 4/8P(        |       ) = 3/8

1/4 3/4

2/41/4

4/8

????

??????

???? ( | )2/4

1/41/8

4/8

3/8

????

????

????

State transition diagram

14

1/4 3/8???? State transition diagram

Methods for Evaluating Models • Use of a test data set {w1, …, wN} not included in training data• Evaluation metrics

• Likelihood N

|| λ

• Log‐scaled likelihood

N

|l|l λ

n

nnN wwpwwp1

11 ||,, λ

• Entropy

n

nnN wwpwwp1

1212 |log|,,log λ

N

wwpwwpH |log1|log1 λ

• Perplexity HPP 2

n

nnN wwpN

wwpN

H1

1212 |log|,,log λ

PP 2

A measure of effective “branching factor”e.g., set a uniform distribution to all n‐gram probabilities for M words

MMNMN

H NN

n

22

12 loglog11log1

e.g., set a uniform distribution to all n gram probabilities for M words

MPP

15

Classification/Inference/GenerationClassification

l l d l

Sequential data

Multiple models

Model selection based onq Model selection based on maximum posterior probability Classification result

( )Model

Inference (prediction)

Calculation of conditional probability

Inference (data or probability)Sequential data

Generation

Model Data generation based on joint probability distribution

Data

Generation

j p y

16

Classification w/ Maximum A Posteriori• Select a model that maximizes the posterior probability

Likelihood function Prior probabilityPosterior probability: 

Constant

Th d l i l t d b

Niixxpi λ ,,|maxargˆ

1

The model is selected by

iiNi

i

pxxp λλ|,,maxarg 1

iNxxpi λ|,,maxargˆ1

If prior probability is given by a uniform distribution,

17

iNixxpi λ|,,maxarg 1

Classification• Comparison of model likelihoods 起: wake up

寝: sleep

/s/1

/s/1

Model 1 Model 2Models:

寝: sleep

起 寝

0.80.1

1

0.7 0.1 起 寝

0.40.4

1

0.6 0.2

/e/0.2 0.1

/e/0.2 0.2

Observed data: /s/ 起起寝起 /e/ Q. Which model is this data sample classified to?

)|()|()|()|()|()( 起eppppspsp 寝起起寝起起起

/s/ 起 起 寝 起 /e/

Likelihood:

Trellis:

0112.02.08.01.07.011 009602040206011

/s/ 起 起 寝 起 /e/Trellis:Expansion of the state transition graph over time axis モデル1:

モデル2: ?? 0096.02.04.02.06.011 モデル2:

A. Classified to the model 1.18

??

Marginalization for Unobserved Data• Likelihood calculation with marginalization even if a part of data is 

not observed. Observed data: /s/ ?寝 ? /e/ Q. Which model is this data sample classified to?

寝起

寝寝,,

331131

)|()|()|()|()(xx

xepxpxpsxpspLikelihood:

寝起寝起

寝寝,

33,

1131

)|()|()|()|()(xx

xepxpxpsxpspConsider all possible data samples

Trellis:

Model 1

data samples

/ / 起 起 / /0 41 0 20.2

/ / 起 起 / /0 81 0 10.2

Model 2

Trellis: /s/ 起

起 /e/

寝 寝

0.4

0 4

0.20

0 4

0.2

/s/ 起

起 /e/

寝 寝

0.8

0 1

0.10

0 1

0.1

017010102080

1.001.011

0320204020404.002.011

0.4 0.40.1 0.1

?? 017.01.01.02.08.0 032.02.04.02.04.0

A. Classified to the model 2.19

Inference• Calculation of posterior probability /s/

1Model 1

Observed data:  /s/ 起 ?起 /e/

起 寝

0.81

0.1Q. Which “起” or “寝” is likely observed at “?” point?

起起起

起起起),,,,()|( espxxxp /e/

0.10.70.1

0.2 0.1Posterior probability:

寝起

起起起起起

,2

312

2

),,,,(),|(

xexsp

xxxp

)( esp 起起起098.02.07.07.011 /s/ 起 起 起 /e/

),,,,( esp 起起起

起寝起0 8

1 0.7

0 1

0.20.7

0980

寝016.02.08.01.011

),,,,( esp 起寝起0.80.1

??860.0

016.0098.0098.0

0160

),|( 312 起起起 xxxp

A. “?” is “起” with140.0

016.0098.0016.0

),|( 312 起起寝 xxxp 86% of probability.

20

??

Generation• Random generation of data samples from the model

/s/1

Model 1/s/

1x1. Sampling        /s/

起 寝

0.80.1

1

0.7 0.1

/s/

2x2. Sampling 

/ /

1

/e/0.2 0.1起

2p g起

1

e.g., … 寝3x3. Sampling 

起 /e/寝

4x4. Sampling /e/寝

0.7 0.2 0.1

/s/起寝起 ・・・ /e/

Various lengths of data are 

e.g., … 起 起 /e/寝

0.8 0.10.1

/s/起寝起 ・・・ /e/

ggenerated.

End if the final state /e/ is sampled21

Maximum Likelihood Data Generation• Data generation by maximizing likelihood under the condition that 

the length of data is given /s/1

Model 1Dynamic programming

起 寝

0.80.10.7 0.1

1. Store the best path at each state

1・0 7 0 7・0 7/s/ 1 1

0 70.7

0 70.49

/e/0.2 0.1

1 0.70・0.8Selection

0.7 0.70.1・0.8Selection

起 起 起

0

0.7

0.1

0.7

0 2

0.12. Back track of the best 

寝 寝寝

00.10.8

0 10.10.8

0 07

0.2

0.1 0.098path from the final state

/e/0 0.1 0.07

1・0.10・0.1Selection

0.7・0.10.1・0.1Selection

0.49・0.20.07・0.1

Selection

/s/ 起起起 /e/ is generated if setting the data length to 3.22