1 modeling long distance dependence in language: topic mixtures versus dynamic cache models...
TRANSCRIPT
1
Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models
Rukmini.M Iyer, Mari Ostendorf
2
Introduction
• Statistical language models – part of state-of-the-art speech recognizers
• Word sequence: <s> w1,w2,…,wT </s>
• Consider the word sequence to be a Markov process
• N-gram Probability:
1
11121 ),...,|(),...,,(
T
iniiiT wwwPwwwP
3
Introduction
• Use bigram representations– Bigram probability
– Trigram in experiments
• Although n-gram models are powerful, but are constrained to n.– Dependencies longer than n ?
• The problem of representing long distance dependencies has been explored.
1
1121 )|(),...,,(
T
iiiT wwPwwwP
4
Introduction
• Sentence-level: – words can co-occur in a sentence due to topic, grammatical
dependencies
• Article or conversation-level:– The subject of an article of a conversation
• Dynamic cache language models address the second issue– Increase the likelihood of a word given that it has been observed
previously in the article
• Other methods:– Trigger language models– Context-free grammars
5
Mixture Model Framework
• N-gram level:
• Sentence level:
Pk: bigram model for k-th class
λk: mixture weight
1
1 111 )|(),...,(
T
i
m
kiikkT wwPwwP
m
k
T
iiikkT wwPwwP
1
1
111 )|(),...,(
6
Mixture Model Framework
• Two main issues:– Automatic clustering to handle data not explicitly
marked for topic dependence– Robust parameter estimation
7
Clustering
• Specified by hand or determined automatically
• Agglomerative clustering– Partition into desired number of topics
• Performed at the article level– Reduce computation
• The initial clusters can be singleton units of data
8
Clustering
1. let the desired number of clusters be C* and initial number of clusters be C
2. Find the best clusters, say Ai and Aj, to maximize some similarity criterion Sij
3. Merge Ai and Aj and decrement C
4. If current number of C=C*, then stop;otherwise go to Step 2
9
Clustering
• Similarity measure:– Set-intersection measure– Combination of inverse document frequencies
|Ai|: the number of unique words in cluster i|Aw|: the number of clusters containing the word w
Nij: the normalization factor
• Advantage of idf measure:– High-frequency words, such as function words are automatically
discounted
ji
jiij
AAw jiw
ijij NN
NNN
AAA
NS
ji
,
10
Robust Parameter Estimation
• Once the topic-dependent clusters are obtained, the parameters of the m components of the mixture can be estimated.
• Two issues:– Initial clusters may not be optimal
• EM• Viterbi-style
– Sparse data problems• Double mixtures:
– Sentence level – N-gram level
11
Iterative Topic Model Reestimation
• Each component model in the sentence-level mixture is conventional n-gram model
• Some back-off smoothing technique used– Witten-Bell technique
• Clustering method is simple, articles may be clustered incorrectly, parameters may not be robust.– Iteratively reestimate
12
Iterative Topic Model Reestimation
• Viterbi-style training technique:– Analogous to vector quantization design with LBG– Maximum likelihood is used as minimum distance
– http://www.data-compression.com/vq.shtml#lbg
• Stop criterion: the size of the m clusters becomes nearly constant
1
1121 )|(max),...,,(maxˆ
T
iiik
kTk
kwwPwwwPk
13
Iterative Topic Model Reestimation
• EM algorithm:– Incomplete data: do not know exactly which topic clus
ter a certain sentence belongs to– Theoretically better than Viterbi algorithm, although th
ere is a little difference in practice– Potentially more robust– Complicated for language model training
• Witten-bell back-off scheme
14
Iterative Topic Model Reestimation
• E-step: expected likelihood of every sentence in the training corpus as belongs to each of the m topics.
• A bigram model in the p-th iteration, the likelihood of the i-th training sentence as belong to the j-th topic is given:
m
k
pkik
pjij
m
ki
pii
p
ip
iip
pii
pij
yP
yP
kzPjzyP
jzPjzyP
yjzPz
p
p
1
)(
)(
1
)()(
)()(
)()(
)(
)(
|
|
,|ˆ
)(
)(
p
zp
i
iterator at onsdistributi lconditiona and
prior class for the parameters ofset combined the:
ysentence ith theofindex class the:)(
i
15
Iterative Topic Model Reestimation
• Derivation:
( )
( )
( ) ( )
( ) ( )
( )( )
( ) ( ) ( ) ( )
( )( )
1
( )( ) ( )
( ) ( ) ( )
1 1
ˆ | ,
, | ,
|
| |
,
( )|
| ( )
p
p
p pij i i
p pi i i i
ppii
p p p pi i i i i i
mppi
i ik
pp pj i ji i i
m mp p p
i i i k i kk k
z P z j y
P z j y P z j y
P yP y
P y z j P z j P y z j P z j
P yP y z k
P yP y z j P z j
P y z k P z k P y
16
Iterative Topic Model Reestimation
• M-step:
• Bigram (wb,wc) probability for the j-th topic-model :
( | ) (1 ) ( | ) ( )j ML j MLj c b b j c b b j cP w w P w w P w
sentence ith thein )w,(w bigram theof occurrence ofnumber the:
sentences trainingofnumber the:
cbibcn
N
17
Iterative Topic Model Reestimation
• Unigram back-off:
q qN
i
ibq
N
i
pij
ibqN
i
pij
ibq
qN
i
ibq
N
i
pij
ibq
jb
n
znzn
n
zn
1
1
)(
1
)(
1
1
)(
ˆˆ
ˆ
sentence ith thein )w,(w bigram theof occurrence ofnumber the:
sentences trainingofnumber the:
cbibcn
N
18
Iterative Topic Model Reestimation
• Meaning: Witten-Bell
( )
1
1
( )
( ) 1
1
1
ˆ
ˆˆ
Ni pbq ij
iN
iqbq
j ib N
i pbq ijN
i p ibq ij N
iq i qbq
i
n z
n
n zn z
n
pij
1
1
1
1
1
ˆif only one topic (eliminate z and ) :
1
1
Nibq
iN
iqbq
qib N
i bbqN q
i ibq N
iq i qbq
i
j
n
n
nn
nn
19
Iterative Topic Model Reestimation
• MLE for bigram (wb, wc):
• Unigram wc:
q
N
i
pij
ibq
N
i
ibc
pij
bcMLj
zn
nzwwP
1
)(
1
)(
ˆ
ˆ)|(
q
N
i
iq
pij
N
i
ic
pij
cj
nz
nzwP
1
)(
1
)(
ˆ
ˆ)(
20
Topic Model Smoothing
• Smoothing due to sparse data:– Interpolate with general model at n-gram level
• Come across nontopic sentences:– Including a general model in addition to the m component mo
dels at sentence level
Gm
k
T
iiiGkiikkk
T
wwPwwP
wwP,
1
1
111
1
)|()1()|(
),...,(
weightssmoothing level gram-n :
weightsmixture level-sentence :
data trainingavailable theall on trainedmodel general :
k
k
GP
21
Topic Model Smoothing
• Weight are estimated separately using a held-out data
k
k
N
l
Ti
ili
liG
oldk
li
lik
oldk
li
lik
oldk
N
ll
newk wwPwwP
wwP
T 1 1 11
1
1
)|()1()|(
)|(1
N
iGmj
Tjoldk
Tkoldknew
k
i
i
wwP
wwP
N 1,,...,1
1
1
),...,(
),...,(1
22
Dynamic adaptation
• The parameters (λk,P(wi|wi-1)) of the sentence-level mixture model are not updated as we observe new sentence.
• Dynamic language modeling tries to capture short-term fluctuations in word frequencies.– Cache or trigger model (word frequency)– Dynamic mixture weight adaptation (topic)
23
Dynamic Cache Adaptation
• The cache maintains a list of counts of recently observed n-grams
• Static language model probabilities are interpolated with the cache n-gram probabilities.
• Specific topic-related fluctuations can be captured by using a dynamic cache model if the sublanguage reflects style.
1
1111 )|()|()1(),...,(
T
iii
cii
sT wwPwwPwwP
model cache for the weight the:
parameters model cache the:
parameters model static smoothed the:
c
s
P
P
24
Dynamic Cache Adaptation
• Two important issues to dynamic language modeling with a cache:– The definition of cache– The mechanism for choosing the interpolated weight
• The probabilities of content words vary with time within the global topic
• Small but consistent improvements over the frequency-based rare-word cache
25
Dynamic Cache Adaptation
• The equation for adapted mixture model incorporating component-caches:
Gm
k
T
iii
ckii
skkT wwPwwPwwP
,
1
1
1111 )|()|()1(),...,(
kP
kPck
sk
component of parameters model cache the:
component of parameters model static smoothed the:
•μ: empirically estimated on a development test set to minimize WER
26
Dynamic Cache Adaptation
• Two approaches for updating the word counts– Partially observed documents is cached and the
cache is flushed between documents.– A sliding window is used to determine the contents of
the cache if the document or article is reasonably long.
• Select first approach– Do not have long articles
27
Dynamic Cache Adaptation
• Extend cache-based n-gram adaptation to the sentence-level mixture model
• Fractional counts of words are assigned to each topic according to their relative likelihoods
• The likelihood of the k-th model given the i-th sentence yi with words w1,…,wT:
1
11,..., ,
( ,..., )ˆ ( | )
( ,..., )i
i
k k Tik i i
j j Tj m G
P w wz P z k y
P w w
28
Dynamic Cache Adaptation
• Derivation:
,
1
1
, ,1
1,..., ,1 1
ˆ |
,
| |
,
( ,..., )| ( )
( ,..., )| ( )
i
i
ik i i
i i
i
i i i i i im G
ii i
j
k k Ti i i k k im G m G
j j Ti i i j j i j m G
j j
z P z k y
P z k y
P y
P y z k P z k P y z k P z k
P yP y z j
P w wP y z k P z k P y
P w wP y z j P z j P y
30
Mixture Weight Adaptation
• The sentence-level mixture weights are updated with each new observed sentence, to reflect the increased information on which to base the choice of weight.
N
z
N
Nz
NNkN
k
N
iik
Nk
ˆ1ˆ
1 1
1
1
11,..., ,
( ,..., )ˆ ( | )
( ,..., )i
i
k k Tik i i
j j Tj m G
P w wz P z k y
P w w
31
Experiments
• Corpus: – NAB news, not marked for topics– Switchboard, 70 topics
• Recognition system:– Boston University system
• Non-parametric trajectory stochastic segment model
– BBN byblos system:• Speaker-independent HMM system