1 modeling long distance dependence in language: topic mixtures versus dynamic cache models...

1

Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models

Rukmini.M Iyer, Mari Ostendorf

2

Introduction

• Statistical language models – part of state-of-the-art speech recognizers

• Word sequence: <s> w1,w2,…,wT </s>

• Consider the word sequence to be a Markov process

• N-gram Probability:

1

11121 ),...,|(),...,,(

T

iniiiT wwwPwwwP

3

Introduction

• Use bigram representations– Bigram probability

– Trigram in experiments

• Although n-gram models are powerful, but are constrained to n.– Dependencies longer than n ?

• The problem of representing long distance dependencies has been explored.

1

1121 )|(),...,,(

T

iiiT wwPwwwP

4

Introduction

• Sentence-level: – words can co-occur in a sentence due to topic, grammatical

dependencies

• Article or conversation-level:– The subject of an article of a conversation

• Dynamic cache language models address the second issue– Increase the likelihood of a word given that it has been observed

previously in the article

• Other methods:– Trigger language models– Context-free grammars

5

Mixture Model Framework

• N-gram level:

• Sentence level:

Pk: bigram model for k-th class

λk: mixture weight

1

1 111 )|(),...,(

T

i

m

kiikkT wwPwwP

m

k

T

iiikkT wwPwwP

1

1

111 )|(),...,(

6

Mixture Model Framework

• Two main issues:– Automatic clustering to handle data not explicitly

marked for topic dependence– Robust parameter estimation

7

Clustering

• Specified by hand or determined automatically

• Agglomerative clustering– Partition into desired number of topics

• Performed at the article level– Reduce computation

• The initial clusters can be singleton units of data

8

Clustering

1. let the desired number of clusters be C* and initial number of clusters be C

2. Find the best clusters, say Ai and Aj, to maximize some similarity criterion Sij

3. Merge Ai and Aj and decrement C

4. If current number of C=C*, then stop;otherwise go to Step 2

9

Clustering

• Similarity measure:– Set-intersection measure– Combination of inverse document frequencies

|Ai|: the number of unique words in cluster i|Aw|: the number of clusters containing the word w

Nij: the normalization factor

• Advantage of idf measure:– High-frequency words, such as function words are automatically

discounted

ji

jiij

AAw jiw

ijij NN

NNN

AAA

NS

ji

,

10

Robust Parameter Estimation

• Once the topic-dependent clusters are obtained, the parameters of the m components of the mixture can be estimated.

• Two issues:– Initial clusters may not be optimal

• EM• Viterbi-style

– Sparse data problems• Double mixtures:

– Sentence level – N-gram level

11

Iterative Topic Model Reestimation

• Each component model in the sentence-level mixture is conventional n-gram model

• Some back-off smoothing technique used– Witten-Bell technique

• Clustering method is simple, articles may be clustered incorrectly, parameters may not be robust.– Iteratively reestimate

12


• Viterbi-style training technique:– Analogous to vector quantization design with LBG– Maximum likelihood is used as minimum distance

– http://www.data-compression.com/vq.shtml#lbg

• Stop criterion: the size of the m clusters becomes nearly constant

1

1121 )|(max),...,,(maxˆ

T

iiik

kTk

kwwPwwwPk

http://www.data-compression.com/vq.shtml#lbg

13


• EM algorithm:– Incomplete data: do not know exactly which topic clus

ter a certain sentence belongs to– Theoretically better than Viterbi algorithm, although th

ere is a little difference in practice– Potentially more robust– Complicated for language model training

• Witten-bell back-off scheme

14


• E-step: expected likelihood of every sentence in the training corpus as belongs to each of the m topics.

• A bigram model in the p-th iteration, the likelihood of the i-th training sentence as belong to the j-th topic is given:

m

k

pkik

pjij

m

ki

pii

p

ip

iip

pii

pij

yP

yP

kzPjzyP

jzPjzyP

yjzPz

p

p

1

)(

)(

1

)()(

)()(

)()(

)(

)(

|

|

,|ˆ

)(

)(

p

zp

i

iterator at onsdistributi lconditiona and

prior class for the parameters ofset combined the:

ysentence ith theofindex class the:)(

i

15


• Derivation:

( )

( )

( ) ( )

( ) ( )

( )( )

( ) ( ) ( ) ( )

( )( )

1

( )( ) ( )

( ) ( ) ( )

1 1

ˆ | ,

, | ,

|

| |

,

( )|

| ( )

p

p

p pij i i

p pi i i i

ppii

p p p pi i i i i i

mppi

i ik

pp pj i ji i i

m mp p p

i i i k i kk k

z P z j y

P z j y P z j y

P yP y

P y z j P z j P y z j P z j

P yP y z k

P yP y z j P z j

P y z k P z k P y

16


• M-step:

• Bigram (wb,wc) probability for the j-th topic-model :

( | ) (1 ) ( | ) ( )j ML j MLj c b b j c b b j cP w w P w w P w

sentence ith thein )w,(w bigram theof occurrence ofnumber the:

sentences trainingofnumber the:

cbibcn

N

17


• Unigram back-off:

q qN

i

ibq

N

i

pij

ibqN

i

pij

ibq

qN

i

ibq

N

i

pij

ibq

jb

n

znzn

n

zn

1

1

)(

1

)(

1

1

)(

ˆˆ

ˆ

sentence ith thein )w,(w bigram theof occurrence ofnumber the:

sentences trainingofnumber the:

cbibcn

N

18


• Meaning: Witten-Bell

( )

1

1

( )

( ) 1

1

1

ˆ

ˆˆ

Ni pbq ij

iN

iqbq

j ib N

i pbq ijN

i p ibq ij N

iq i qbq

i

n z

n

n zn z

n

pij

1

1

1

1

1

ˆif only one topic (eliminate z and ) :

1

1

Nibq

iN

iqbq

qib N

i bbqN q

i ibq N

iq i qbq

i

j

n

n

nn

nn

19


• MLE for bigram (wb, wc):

• Unigram wc:

q

N

i

pij

ibq

N

i

ibc

pij

bcMLj

zn

nzwwP

1

)(

1

)(

ˆ

ˆ)|(

q

N

i

iq

pij

N

i

ic

pij

cj

nz

nzwP

1

)(

1

)(

ˆ

ˆ)(

20

Topic Model Smoothing

• Smoothing due to sparse data:– Interpolate with general model at n-gram level

• Come across nontopic sentences:– Including a general model in addition to the m component mo

dels at sentence level

Gm

k

T

iiiGkiikkk

T

wwPwwP

wwP,

1

1

111

1

)|()1()|(

),...,(

weightssmoothing level gram-n :

weightsmixture level-sentence :

data trainingavailable theall on trainedmodel general :

k

k

GP

21

Topic Model Smoothing

• Weight are estimated separately using a held-out data

k

k

N

l

Ti

ili

liG

oldk

li

lik

oldk

li

lik

oldk

N

ll

newk wwPwwP

wwP

T 1 1 11

1

1

)|()1()|(

)|(1

N

iGmj

Tjoldk

Tkoldknew

k

i

i

wwP

wwP

N 1,,...,1

1

1

),...,(

),...,(1

22

Dynamic adaptation

• The parameters (λk,P(wi|wi-1)) of the sentence-level mixture model are not updated as we observe new sentence.

• Dynamic language modeling tries to capture short-term fluctuations in word frequencies.– Cache or trigger model (word frequency)– Dynamic mixture weight adaptation (topic)

23

Dynamic Cache Adaptation

• The cache maintains a list of counts of recently observed n-grams

• Static language model probabilities are interpolated with the cache n-gram probabilities.

• Specific topic-related fluctuations can be captured by using a dynamic cache model if the sublanguage reflects style.

1

1111 )|()|()1(),...,(

T

iii

cii

sT wwPwwPwwP

model cache for the weight the:

parameters model cache the:

parameters model static smoothed the:

c

s

P

P

24


• Two important issues to dynamic language modeling with a cache:– The definition of cache– The mechanism for choosing the interpolated weight

• The probabilities of content words vary with time within the global topic

• Small but consistent improvements over the frequency-based rare-word cache

25


• The equation for adapted mixture model incorporating component-caches:

Gm

k

T

iii

ckii

skkT wwPwwPwwP

,

1

1

1111 )|()|()1(),...,(

kP

kPck

sk

component of parameters model cache the:

component of parameters model static smoothed the:

•μ: empirically estimated on a development test set to minimize WER

26


• Two approaches for updating the word counts– Partially observed documents is cached and the

cache is flushed between documents.– A sliding window is used to determine the contents of

the cache if the document or article is reasonably long.

• Select first approach– Do not have long articles

27


• Extend cache-based n-gram adaptation to the sentence-level mixture model

• Fractional counts of words are assigned to each topic according to their relative likelihoods

• The likelihood of the k-th model given the i-th sentence yi with words w1,…,wT:

1

11,..., ,

( ,..., )ˆ ( | )

( ,..., )i

i

k k Tik i i

j j Tj m G

P w wz P z k y

P w w

28


• Derivation:

,

1

1

, ,1

1,..., ,1 1

ˆ |

,

| |

,

( ,..., )| ( )

( ,..., )| ( )

i

i

ik i i

i i

i

i i i i i im G

ii i

j

k k Ti i i k k im G m G

j j Ti i i j j i j m G

j j

z P z k y

P z k y

P y

P y z k P z k P y z k P z k

P yP y z j

P w wP y z k P z k P y

P w wP y z j P z j P y

29


• Meaning:

k topic caches

n features

Sentence i

zi1

zi2

zik

30

Mixture Weight Adaptation

• The sentence-level mixture weights are updated with each new observed sentence, to reflect the increased information on which to base the choice of weight.

N

z

N

Nz

NNkN

k

N

iik

Nk

ˆ1ˆ

1 1

1

1

11,..., ,

( ,..., )ˆ ( | )

( ,..., )i

i

k k Tik i i

j j Tj m G

P w wz P z k y

P w w

31

Experiments

• Corpus: – NAB news, not marked for topics– Switchboard, 70 topics

• Recognition system:– Boston University system

• Non-parametric trajectory stochastic segment model

– BBN byblos system:• Speaker-independent HMM system

32

Experiments

33

Experiments

34

Experiments

35

Conclusions

• Investigate a new approach to language modeling using a simple variation of the n-gram approach: sentence-level mixture

• Automatic clustering algorithm to classify text as one of m different topics

• Other language modeling advances can be easily incorporated in this framework.

1 modeling long distance dependence in language: topic mixtures versus dynamic cache models...

Documents