sequential data modeling...2016/07/01 · sequential data modeling tomoki toda 2 graham neubig1...

Sequential Data ModelingSequential Data Modelingq gq g

Tomoki Toda 21Graham Neubig 1

Sakriani Sakti 1

1A d H C i i L b

Sakriani Sakti

1 Augmented Human Communication LaboratoryGraduate School of Information Science, NAIST

2 I f ti T h l C t / G d t S h l f I f ti S i2 Information Technology Center / Graduate School of Information ScienceNagoya University

Review: Evaluation/Alignment/TrainingEvaluation

Model λ

Sequential Forward/Backward Likelihood

Model λ

qdata algorithms

( )

z

λzxλx all

)|,()|( ppx

Alignment (Decoding)Model λ

Viterbi algorithm )|,(maxargˆ λzxzz

pSequential data x

State sequence

Training

z

Baum‐Welch (i.e., EM) algorithm )|(maxargˆ λxλ

λp

Sequential data x

Model parameter set

Review: 1

λ

Example of Parameter Estimation起: wake up 寝: sleepTraining data samples with state sequences:

• /s/⇒ state 1:起⇒ state 1:起⇒ state 2:寝• /s/ ⇒ state 1: 起⇒ state 1: 起⇒ state 2: 寝• /s/ ⇒ state 2: 寝⇒ state 2: 起⇒ state 1: 起⇒ state 1: 寝• /s/ ⇒ state 2: 寝⇒ state 2: 起

/s/ * States 1 and 2起 0 起 012

Number of observed samples:

13 2/s/ States 1 and 2 can be a final state.0起: 0

寝: 0起: 0寝: 00

0

112

11

1213

12

32

1 20

001 11

2 2

)(1 起B )(2 起B/s/1 2Maximum likelihood estimates:

2/(1+2)1/(1+2)3/(3+1) 2/(2+3))(

)(

1

1

寝B )()(

2

2

寝B/ /

A2,1A

A

2/(1+2)1/(1+2)

1/(2+1)

3/(3 1)1/(3+1)

( )3/(2+3)

1 21,1A

1,2A

2,2A2/(2+1) 2/(2+1)1/(2+1)

Review: 2

Review: Lower Bound of HMM Likelihood

ULog‐scaled likelihood function for U samples of sequential data

U

u

uuU

u

pp1 all

)()()()1(

)(

|,ln|,,lnz

λzxλxx

λz

λzxzz

,)(

)|,(ln1 all

)(

)()()(

)(

qq

pqU

uu

uuu

u

L Lower bound

)()(

E‐step: calculate posterior probabilities of latent variables (i.e., state sequences)

Lower bound

)(old

)()(old

)()(

old)()()(

)|,()|,(,|ˆ uu

uuuuu

pppq

λzxλzxλxzz

U

)(all uz

M‐step: maximize auxiliary function with respect to model parameters

U

u

uuu

u

pq1 all

new)()()(

oldnew)(

|,ln ˆ,z

λzxzλλQ

Review: 3

Review: E‐Step• Calculate posterior probabilities of latent variables

ˆ)( )()( un

us szqn

Expected # of samples observed in state s at time n in sample u

)()( ss nn

)|(

,|)()(

old)()(

λx

λxuu

uun

szp

szp

1 1 1 1 1

)|()|,(

old)(

old

λxλx

un

pszp

3

2

3

2

3

2

3

2

3

2

',ˆ)1( )()(1

)('

un

un

uss szszqn

Expected # of samples from state s’ at time n – 1 to state s at time n in sample u

)|'(

,|',

,)(

)()()(old

)()()(1

1,

λ

λxuuu

uun

un

nnss

szszp

q

1 1 1 1 1

)'()( '',1 sBAs nnsssn x

)|()|',,(

old)(

old)()(

1)(

λxλx

u

un

un

u

pszszp

1

2

1

2

1

2

1

2

1

2

3 3 3 3 3Review: 4

Review: M‐Step

SS SSAuxiliary function

S

sss

S

s

S

sssss

S

sss BAn

1 "o" all1 1'',',

1old o""ln)o""(lnln)1(, λλQ

)1(

S For each state,

ML estimates

S

s

ss

n

n

)1(

)1(ˆ

0

1,1

old

λλs

s

Q Initial state probability

ss

1)(

ˆ λλs

1, ',old

λλ

S

ssAQ Sss

ssA ','

ˆ Transition

0ˆ',

1'

λλss

s

A

S

sss

ss

1'',

,

probability

0

)""(

)o""(1,"o" all

old

λλ s

B

BQ

)o"(")o"(")o""(ˆ

s

ssB

Output

probability)o""( ˆ λλsB "o" all

Review: 5

Review: Example of E‐Step

起寝寝2Forward and backward probabilities Forward

BackwardTime

起寝寝

0.8 0.2 0.2

1n 2n 3n Backward

0.060.32

0.0121

0.40.14560.5

0.7

0 3Initial Pseudo final

0.7

0 3 1

1s

0 18 0 1080 20.50.4 0.6

0.3

0.11

0.120.12

10.60.3

0.1

1

12 0.18

0.560.108

10.2

0.3088 0.9 0.92sState

1456.04.0

0.05824 0.0192 0.0120.00840.01792Their products

32.02.07.04.0

1456.04.0

0.0108

0.0036

0.04032

0.00128

0.05824

0.061760.06176 0.1008 0.108

0.09720.06048Review: 6

Review: Example of Posterior ProbabilitiesTime起寝寝1n 2n 3n

058240 01920 012012.0

01792.0)1()(1,1 u

12.00084.0)2()(

1,1 u

12.005824.0)1()1(

1 12.0

0192.0)2()1(1

12.0012.0)3()1(

1 1s

0108.0)2()(u04032.0)1()(u12.0

0 08.0)2()(2,1 u

0036.0)2()(u

12.00 03.0)1()(

2,1 u

00128.0)1()(u

06176.0)1()1(2 1008.0)2()1(

2 108.0)3()1(2 2s

12.0)2()(

1,2 u12.0

)1()(1,2 u

12.0)(2 12.0

)(2 12.0)(2

12006048.0)1()(

2,2 u120

0972.0)2()(2,2 u

2s

State 12.0 12.0State

Calculate these posterior probabilities (= expected # of samples)

Review: 7

sequence by sequence

Review: Example of Sufficient Statistics起

寝

1n2Posterior probabilities (= expected # of samples) 寝

寝 2n 3n

/s/12.005824.0)1()1(

1 12.0

06176.0)1()1(2

Posterior probabilities ( expected # of samples)

12.005824.0)1()1(

1 起12.0

06176.0)1()1(2 起

0192.0)2()1( 寝 1008.0)2()1( 寝012.0)3()1( 寝 108.0)3()1( 寝12.0

)2(1 寝12.0

)2(2 寝12.0

)3(1 寝

01792.0)1()(u06048.0)1()( u12.0

04032.0)1()(2,1 u

12.00108.0)2()(

2,1 u

12.0)3(2 寝

12.0)1()(

1,1 u 12.0)1(2,2

12.00084.0)2()(

1,1 u12.0

0972.0)2()(2,2 u

1200036.0)2()(

1,2 u120

00128.0)1()(1,2 u

1 2

12.012.0

Sufficient statistics (= expected # of samples for each parameter)

12.005824.0)1(1 n

06176012.0

02632.01,1

12.005112.0

2,1

004880 1 68012.0

05824.0)(1 起

2088012.0

0312.0)(1 寝

061 6012.0

06176.0)1(2 n12.0

00488.01,2

12.015768.0

2,2 12.0

2088.0)(2 寝12.0

06176.0)(2 起Review: 8

Review: Example of ML EstimatesSufficient statistics (expected # of samples for each parameter)

12.005824.0)1(1 n

12.006176.0)1(2 n

051120

12.005824.0)(1 起

0312.0)(1 寝

/s/起:寝:

起: 寝:

1202088.0)(2 寝12.0

06176.0)(2 起

02632.011

12.005112.0

2,1

004880 15768.0

12.0)(1 寝

1 2

寝: 12.0)(2

12.01,1

12.000488.0

1,2 12.02,2

)1( n ML estimates

49.0)1()1(

)1(ˆ21

11

nn

n

34.0ˆ2,11,1

1,11,1

A

)1(n

66.0ˆ2,11,1

2,12,1

A

51.0

)1()1()1(ˆ

21

22

nn

n

03.0ˆ2,21,2

1,21,2

A 97.0ˆ2,21,2

2,22,2

A

)(起 )(寝65.0)()(

)()(ˆ11

11

寝起

起起

B 35.0

)()()()(ˆ11

11

寝起

寝寝

B

)(起 )(寝23.0)()(

)()(ˆ22

22

寝起

起起

B 77.0

)()()()(ˆ22

22

寝起

寝寝

B

Review: 9

Sequential Data ModelingSequential Data Modelingq gq g

55thth classclass“Continuous Latent Variable Model 1” “Continuous Latent Variable Model 1”

Tomoki TodaInformation Technology Center / Graduate School of Information Science

N U i itNagoya University

Basic Techniques

Discrete latent variables Continuous latent variables

Mixture model (e g GMM) Factor analysis (FA)

Discrete latent variables Continuous latent variables

Mixture model (e.g., GMM) Factor analysis (FA)

1z 2z 3z1z 2z 3z

mod

el

1x 2x 3x1x 2x 3x

Linear dynamical systems (LDS)Markov m

Hidden Markov model (HMM) Linear dynamical systems (LDS)M Hidden Markov model (HMM)

1z 2z 3z1z 2z 3z

1x 2x 3x1x 2x 3x

1

Continuous LatentContinuous LatentContinuous Latent Continuous Latent VariablesVariables

(from(from PCAPCA to FA)to FA)(from (from PCAPCA to FA)to FA)

Example of High Dimensional Data• Example of hand‐written digits

• Each image of 100 x 100 = 10,000 pixels, i.e., represented as 10,000 dimensional vector

Each image is represented as one point in the space.

10,000 dimensional space

However # of the degrees of freedom of variability would be limitedHowever, # of the degrees of freedom of variability would be limited…(e.g., only vertical and horizontal translations and the rotations: 3 degrees)

Can we find a lower dimensional subspace on which the data points live?

2

Extraction of Synthetic Variables • Synthesis of new variables by linearly combining observable variables

e.g., from 2‐dimensional observation data to one dimensional data

on5060di

men

si

2,1,5.0 nnn xxy

150 402n

dd

40 15.0

1, n

xx

40ny20

2040

nx 2,nx

represented by inner product:

1st dimension

nny xw Τrepresented by inner product:

0

1, ,

15.0 n

n

xxwwhere

3

2,1 n

n x

Principal Component Analysis (PCA)• How to extract a synthetic variable that the most effectively

represents observable variables?

• Determine a unit vector by maximizing a variance of synthetic variables Mean vector

Synthetic variable:Mean value

μ

Its variance :

Constraint : i.e., unit vector (length = 1)

4

Eigenvalue Problem• Maximization of variance of synthetic variable

nn

N

n 1

N

nnnN 1

1 ΤμxμxSμ μ

Maximize the following objective function with respect to uMaximize the following objective function with respect to

C t i t

u uF

VarianceConstraint

Lagrange multiplier

Ei l bl F Eigenvalue problem:

Eigenvector

0uu

F

Eigenvalue5

Eigenvector and Eigenvalue• Eigenvalue problem Direction

Variance of synthetic

= Eigenvector

Variance of synthetic variable = Eigenvalue

Variance of synthetic i bl / i

Eigenvalue Eigenvector for the largest eigenvaluerepresents the direction that maximizes

variable w/ eigenvector1u

pthe variance of a synthetic variable.

• Synthetic variable with eigenvector (= principal component)

μxuy Τ Its mean = 0 μxu nny 11,

n

Its mean 0Its variance = Eigenvalue 1

6n

Projection onto Low‐Dimensional Space

21,uu• Extraction of multiple eigenvectors, e.g.,

Orthonormal vectors: 1u2uConstraints

and

• Represent high‐dimensional data w/ low‐dimensional data (i i i l t )

nx ny(i.e., principal components)

1st principal component:Synthetic variable w/ an eigenvector for the largest eigenvalue

μxu

y

n

nn

yΤ

Τ11, n

Synthetic variable w/ an eigenvector for the largest eigenvalue

u

nn

n y Τ22,

n2nd principal component:h bl / f h d l l

1 0

ΛM 0 C i t i

Synthetic variable w/ an eigenvector for the 2nd largest eigenvalue

7

2

1

0 ΛMean vector: 0 Covariance matrix:

Whitening Transformation

0Mean vector:μMean vector: U Τ

nΛCovariance :SCovariance ：

x

μxUy nnΤ

n

y

nx

ny

Whiteningnn yΛz 2/1

n nWhitening

UΛ Τ2/1 μxUΛz nnΤ2/1

n

0Mean vector:Linear transform f hit i

nn

nzI0

Covariance :for whitening

8

Continuous LatentContinuous LatentContinuous Latent Continuous Latent VariablesVariables

(from PCA to(from PCA to FAFA))(from PCA to (from PCA to FAFA))

Whitening Process with PCA

Linear transformationHigh‐dimensional space Low‐dimensional space

Linear transformation for whitening

z μxuΤ21

Observation data: nxμMean vector :

Low dimensional data: nzMean vector : 0

1 n

nn z μxu11SCovariance ： Covariance :

nx1

n

nz01. Dimension reduction

2 Processing for low dimensional data2. Processing for low‐dimensional datae.g., probability density modeling

Regard low‐dimensional data as observation data Ignore errors caused by linear transformation

i.e., unable to model probability density of the original observation data

9

Basic Idea of Factor Analysis (FA)

Linear transformationHigh‐dimensional space Low‐dimensional space

Linear transformation Low dimensional data: nzMean vector :μux nn z21

11ˆ 0Projected data: nx̂

nn Covariance :

nx̂ 1

tz02. Projection onto

1. Low‐dimensionaldata generation

Observation data: nxsubspace

data generation

x exx ˆ3. Random noise addition

nx nnn exx

10

Comparison between PCA and FA• FA capable of defining p.d.f. of observation data based on inversion

process of whitening transformation

Observation data: Low dimensional data:x z

10

Observation data: nx nzμMean vector :

SCovariance ：

Mean vector :Covariance :

Τ1Whitening with PCA

nxS μxu nnz Τ

11

1

ˆnz0

ˆ

nx

Modeled asModeled as

μux nn z11ˆ Factor analysis (FA)

Error: a random variablea random variable

Modeled as a random variableModeled as a random variableModeled as a random variableModeled as a random variable

11

Representation of Observation Data w/ FA• Representation of observation data Loading matrixg

n

n Observation model given the factors n n

ΣμWzxλzx , ;,| nnnnp Ngiven the factors

Observation noiseFactors Observation noise

Σ0eλe , ;| nnp N I0zλz ;|p N

Factors(low dimensional data)

I0zλz ,;| nnp N

012

Marginalization over Latent VariablesIf one sample is generated

If K samples are generated

If an infinite # of samples are generatedgenerated… generated… are generated…

λ|,, )()1( K zpzz λ|)1( zpz 0 0 0

λ|nzp λ|,, nnn zpzz λ|nn zpz |np

Σ

λx ,|)1(

)1( nn zp

N K

k )(1 ΣN

n

zzpzp

p

d||

|

λλx

λx

λx |np

Σμwx , ; )1( nn zN

k

knn z

K 1

)( ,; ΣμwxN nnnn zzpzp d |,| λλx13

Derivation of p.d.f. of Observation Data• Derived by marginalizing the joint p.d.f. over factors, which are

nnnnn ppp zλzλzxλx d|,||regarded as a latent variable

nnnnn ppp zzz |,||

nnnn zI0zΣμWzx d ,;,; NN

ΣWWμx Τ,;nN

Covariance matrix

Expectations:

ΤΤ zI0zμWzμWzΣxx

μμzWzI0zμWzx

nnnnn

d,;

d,;

N

N = Mean vector

ΤΤΤΤΤΤ μzWWzμμμWzzWΣ

zI0zμWzμWzΣxx

nnnn

nnnnnn d,;N

ΤΤ μμWWΣ = Covariance matrix + Squared mean vector14

Comparison between GMM and FAGMM: discrete latent variables FA: continuous latent variables

1z 2z 3z1z 2z 3z

1x 2x 3x1x 2x 3x

nn zpp || λλx

λλx |1|1 ,

M

m mnn zpp

nnn zzp d ,| λx λx ,1| , mnn zp

Prior = discrete distribution Prior = Gaussian distributionPrior discrete distribution Prior Gaussian distribution

15

M d l T i iM d l T i iModel TrainingModel Training(Parameter Optimization)(Parameter Optimization)(Parameter Optimization)(Parameter Optimization)

Maximum Likelihood (ML) Estimation• Log‐scaled likelihood function:

N

nn

N

nnnnpp

11,;lnd |,ln|ln ΣWWμxzλzxλX ΤN

• ML estimates of model parameters: Linear equation!

N

nnN 1

1ˆ xμ 0μ

λX

λλ

ˆ

|ln pMean vector : equation!

L di t i 0λX |ln pNonlinear equations… ˆLoading matrix : 0

WλX

λλ

ˆ

|ln p equations…

λX |ln p

W

ˆ

?

Covariance matrix : 0Σ

λX

λλ

ˆ

|ln p Σ̂ ?

How to determine ML estimates of these parameters?16

Lower Bound of Likelihood Function• Derivation of lower bound of log‐scaled likelihood function

Log‐scaled likelihood function:

zλzxλX d |,ln|ln1

ppN

nnnn

Probability density functionof latent variables

zq

zz

λzxz d |,ln1 q

pqN

nn

n

nnn

J ’ i li

zz

λzxz d|,ln1

1

qpq

qN

nnn

n

n n

Jensen’s inequality

λ

z,

1

qqn n

L

Lower bound:

N

nnn

n qpqq d |,ln, z

zλzxzλL n nq1 z

17

EM Algorithm• Maximization of lower bound (functional of q and function of λ )

N

Maximize lower bound with respect to q:

n

nnn pqpq1

,|||KL|ln, λxzzλXλL

KL divergenceLog‐scaled likelihood

q||pKLMaximize lower bound with respect to λ（={W, Σ}):

λX |ln p λ,qL

N

nnnnn pqq

1

d |,ln, zλzxzλLn 1

Auxiliary function

18

Review: Schematic Image of EM Algorithm

λX |ln p

λX |ln

λX |ln pLog‐scaled likelihood function

λX |ln p )1()1( , iiq λL

)()( , iiq λL λ,)1( iqL

λ)(iλ )1( iλ

λ,)(iqL )(iλ )1( iλ

1 E‐step: determine lower

,q3. E‐step

0. Current model parameter set

2. M‐step: update model parameters based on the lower bound

1. E step: determine lowerbound based on current model parameters

parameter set the lower bound

19

E‐Step: Update q• Set KL divergence to 0 under the fixed model parameters oldλ

old,|ˆ λxzz nnn pq 0,|||KL1

old

N

nnnn pq λxzz

C l l t t i b bilit d it f l t t i bl f h lCalculate posterior probability density of latent variables for each sample

0ˆKL ||pq || pp λzλzx

old,| λxz nnp

d |,|

|,|

oldold

oldold

nnnn

nnn

pppp

zλzλzxλzλzx

old|ln λXp old,ˆ λqL |,|1oldold nnn pp

Zλzλzx

,;,ˆ;1nnnZ

I0zΣμWzx NN

? , ? ;nzN20

Posterior Probability Density Function

const1exp,| )|(1)|(1)|(old

xzn

xznn

xznnnp μΣzzΣzλxz ΤΤ

2

p,| old nnnnnnp μzzzz

)|(1)|()|(1exp xzxzxz μzΣμz Τ* See appendix 1

2

exp nnnn μzΣμz

Posterior probability density function of latent variables:

ppfor derivation

)|()|(old ,;,| xzxz

nnnnp Σμzλxz N

Posterior probability density function of latent variables:

11)|( IWΣWΣ ΤxzCovariance matrix

+1

‐ Sample‐independent

xzxz ΣWΣ 1)|()|( ΤMean vector

‐ Full matrix

nxzxz

n xΣWΣμ 1)|()|( Τn

nxA ‐ Sample‐dependent‐ Linear transformation

n

μxx ˆ nn －n

wheren

21

Schematic Image of E‐Step

)|()|( xx

Posterior probability density funciton of latent variables: )|()|(

old ,;,| xzxznnnnp Σμzλxz N

11 IWΣW Τ μxΣWΣ ˆ1)|( xz Τ CovarianceMean IWΣW μxΣWΣ n

+n1

－

Covariance matrixvector

Observation data samples

)|()|(|| xzxzp Σμzλxz N

Posterior p.d.f. of latent variables

1x )|()|(11old11 ,|,|p Σμzλxz N1

1st data sample

)|()|(|| xzxz Σλ Nx )|()|(22old22 ,|,| xzxzp Σμzλxz N2x2nd data sample

22

M‐Step: Update λ• Maximize auxiliary function with respect to model parameters newλ

N

nnnn pqq d |,lnˆ, zλzxzλL

||pq̂KLn 1

Auxiliary function

N oldnew , λλQ

d |,ln,|1

newold

N

N

nnnnnn pp zλzxλxz new,ˆ λqL

new|ln λXp d ,|ln,|

1newold

N

nnnnnn pp zλzxλxz

d ,;ln,;1

)|()|(

N

nnnn

xzxznn zΣWzxΣμz NN

?23

Expansion of Auxiliary Function

N

nnnxzxz

nn)|()|(

oldnew d ,;ln,;, zΣWzxΣμzλλ NNQ* See appendix 2for more details

nnnnnn

1oldnewQ

N

nn11

21ln

21 xΣxΣ T Expectation of nz n

z

nnn

1 22

nnnxzxz

nn xΣWzzΣμz 1)|()|( d ,; TTN

Expectation of Τnnzz

Τzz

d ,;tr

21 )|()|(1

nnnxzxz

nn zzzΣμzWΣW TT N

n

N

nnN 11 tr21ln

21 TxxΣΣ Summation of T

nnxx

2

n 122

NN11 tr1tr TTTT zzWΣWzxΣW

nn

# of samples

n

nn

nn11

tr2

tr zzWΣWzxΣW

Summation of TzzSummation ofTzx Summation of

nzzSummation of nn zx

24

Sufficient Statistics

Analytical calculation of expectations:

)|()|()|( d ,; xznnn

xzxznnn

μzzΣμzz N TTT )|()|()|()|()|( d; xzxzxzxzxz μμΣzzzΣμzzz N

n

+n

)|()|()|()|()|( d,; nnnnnnnnμμΣzzzΣμzzz N n nn +

Sufficient statistics:

N

NSufficient statistics:

# of samples

N

nnn

1

TT xxxx

N

n 1 nndiagSum of squared samples

N

n 1

N

nn

1

TT zzzzn

Sum of expectations of squared latent variables

N

nnn

1

TT zxzx

N

n 1 nnSum of cross terms

25

n 1

ML Estimates

TxxΣΣλλ 11oldnew tr1ln1, NQAuxiliary function: oldnew 22

,Q

TTTT zzWΣWzxΣW 11 tr1tr

y

zzz2

ML estimate of loading matrix: λλ Linear equation! 0zzWΣzxΣW

λλ

λλ

TT ˆˆˆ, 11

ˆ

oldnewQ

1

Linear equation!

ML i f i iLinear equation!

1ˆ TT zzzxW

1

ML estimate of covariance matrix: 0WzzWWzxxxΣΣ

λλ

TTTTT ˆˆ

21ˆ

21ˆ

21diag,

1oldnew NQ

q

Σ λλ 222ˆ

1

TTT WzxxxΣ ˆdiag1ˆ diag

26

WzxxxΣ diagN

diag

App. 1: Derivation of Posterior p.d.f.nx I0zΣμWzxλx ,;,ˆ;,| old nnnnnzp NN

nnnnnn zzWzμxΣWzμx ΤΤ

21expˆˆ

21exp 1

－ 22

const11 11 xΣWzzzWzΣWz ΤΤΤΤΤInside of exp( ) const22

nnnnnn xΣWzzzWzΣWz

const1 11 xΣWzzIWΣWz ΤΤΤΤ

Inside of exp( )

1)|( xzΣ

const2

nnnn xΣWzzIWΣWz

+)|( xzΣ

const21 1)|(1)|(1)|(

nxzxz

nnxz

n xΣWΣΣzzΣz ΤΤΤ

)|( xz1

2

)|( xznμ

const21 )|(1)|(1)|(

xzn

xznn

xzn μΣzzΣz ΤΤ

Appendix: 1

App. 2: Expansion of Auxiliary Function (1)Auxiliary function:

N

N

N

nnnn

xzxznn

1

)|()|(oldnew d ,;ln,;, zΣWzxΣμzλλ NNQ

N

nnn

xzxznn

1

11)|()|(

21ln

21 ,; xΣxΣΣμz TN

nnnnn zzWzΣWxΣWz d tr21 11

TTTTQuadratic form in nz

N

nnn

1

11

21ln

21 xΣxΣ T

n 1

nnnxzxz

nn xΣWzzΣμz 1)|()|( d ,; TTN

d ,;tr

21 )|()|(1

nnnxzxz

nn zzzΣμzWΣW TT NExpectation of nz

Expectation of Τ

nnzzAppendix: 2

App. 2: Expansion of Auxiliary Function (2)

Analytical calculation of expectations:

)|()|()|( d ,; xznnn

xzxznnn

μzzΣμzz N TTT )|()|()|()|()|( d; xzxzxzxzxz μμΣzzzΣμzzz N

n

+n

)|()|()|()|()|( d,; nnnnnnnnμμΣzzzΣμzzz N

N

11 11 T

n nn +

Q d ti f i x

nnn

1

11oldnew 2

1ln21, xΣxΣλλ TQ

1 11 TTTT

Quadratic form in nx

tr

21 11

nnnTTTT zzWΣWxΣWz

N11

nnnN

1

11 tr21ln

21 TxxΣΣ

Summation of Tnnxx

# f l

N

nn

N

nnn

1

1

1

1 tr21tr TTTT zzWΣWzxΣW

# of samples

Summation of n

TzzSummation of Tnn zx

Appendix: 3

App. 2: Expansion of Auxiliary Function (3)Sufficient statistics:

N

TT xxxx

N

N n

# of samples

f d l

n

nn1xxxx

NN

TT

n 1 nndiagSum of squared samples

Sum of expectations of

N

TTN

N

n 1

n

n1

TT zzzzn

Sum of expectations of squared latent variables

n

nn1

TT zxzx

N

n 1 nnSum of cross terms

TxxΣΣλλ 11oldnew tr

21ln

21, NQAuxiliary function:

22

TTTT zzWΣWzxΣW 11 tr21 tr 2

Appendix: 4

sequential data modeling...2016/07/01 · sequential data modeling tomoki toda 2 graham neubig1...

Documents