statistical language modeling for speech recognition and information retrieval
DESCRIPTION
Statistical Language Modeling for Speech Recognition and Information Retrieval. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University. Outline. What is Statistical Language Modeling - PowerPoint PPT PresentationTRANSCRIPT
Statistical Language Modeling for Speech Recognition and Information Retrieval
Berlin ChenDepartment of Computer Science & Information Engineering
National Taiwan Normal University
Berlin Chen 2
Outline
• What is Statistical Language Modeling
• Statistical Language Modeling for Speech Recognition, Information Retrieval, and Document Summarization
• Categorization of Statistical Language Models
• Main Issues for Statistical Language Models
• Conclusions
Berlin Chen 3
What is Statistical Language Modeling ?
• Statistical language modeling (LM), which aims to capture the regularities in human natural language and quantify the acceptance of a given word sequence
.0""
,01.0""
,01.0""
roofing on theand nuts sP
truthg but the and nothinP
hiP
Adopted from Joshua Goodman’s public presentation file
Berlin Chen 4
What is Statistical LM Used for ?
• It has continuously been a focus of active research in the speech and language community over the past three decades
• It also has been introduced to the information retrieval (IR) problems, and provided an effective and theoretically attractive probabilistic framework for building IR systems
• Other application domains– Machine translation– Input method editor (IME)– Optical character recognition (OCR)– Bioinformatics – etc.
Berlin Chen 5
Statistical LM for Speech Recognition
• Speech recognition: finding a word sequence out of the many possible word sequences that has the maximum posterior probability given the input speech utterance
WPWXP
XP
WPWXP
XWPW
W
W
W
max arg
max arg
max arg*
Language ModelingAcoustic Modeling
mwwwW ,...,, 21
Berlin Chen 6
Statistical LM for Information Retrieval
• Information retrieval (IR): identifying information items or “documents" within a large collection that best match (are most relevant to) a “query” provided by a user that describes the user’s information need
• Query-likelihood retrieval model: a query is considered generated from an “relevant” document that satisfies the information need– Estimate the likelihood of each document in the collection being t
he relevant document and rank them accordingly
DPDQPQP
DPDQP
QDP
Document Prior ProbabilityQuery Likelihood
QD
Berlin Chen 7
Statistical LM for Document Summarization
• Estimate the likelihood of each sentence of the document being in the summary and rank them accordingly
– The sentence generative probability can be taken as a relevance measure between the document and sentence
– The sentence prior probability is, to some extent, a measure of the importance of the sentence itself
SPSDPDP
SPSDP
DSP
S
D
Sentence Prior ProbabilitySentence Generative Probability(or Document Likelihood)
SDPSD
SP
Berlin Chen 8
n-Gram Language Models (1/3)
• For a word sequence , can be decomposed into a product of conditional probabilities
– E.g.,
– However, it’s impossible to estimate and store if is large (the curse of dimensionality)
WPW
m
iii
mm
m
wwwwPwP
wwwwPwwwPwwPwP
wwwPWP
21211
121213121
21
,...,,
,...,,...,
,...,,multiplication rule
i 1i21i w,...,w,wwP
History of wi
""""
" """" """
""""""""
g but theand nothintruthP
tnothing buandthePnothingandbutP
andnothingPandPtruthg but the and nothinP
Berlin Chen 9
n-Gram Language Models (2/3)
• n-gram approximation
– Also called (n-1)-order Markov modeling
– The most prevailing language model
• E.g., trigram modeling
– How do we find probabilities? (maximum likelihood estimation)• Get real text, and start counting (empirically)
121121 ,...,,,...,, ininiiii wwwwPwwwwPHistory of length n-1
""""""
""""""
but the"truthPg but theand nothinh whole trut"truthP
tnothing bu"thePg butand nothinh whole trut"theP
tnothing buCt thenothing buCtnothing bu"theP /"""
countProbability may be zero
Berlin Chen 10
n-Gram Language Models (3/3)
• Minimum Word Error (MWE) Discriminative Training– Given a training set of observation sequences , the
MWE criterion aims to minimize the expected word errors of these observation sequences using the following objective function
– MWE objective function can be optimized with respect to the language model probabilities using Extended Baum-Welch (EBW) algorithm
},,,,{ 1 Uu XXX X
U
uuu
Suuu
uuu
SMWE RSRawAcc
SPSXP
SPSXPXF
uuuu1
),()()|
)()|),,(
WW
kk
MWEk
i
MWEi
iMWE
ChwP
XFhwP
ChwP
XFhwP
hwP
)|(),(
)|(
)|(),(
)|(
)|(
Berlin Chen 11
n-Gram-Based Retrieval Model (1/2)
• Each document is a probabilistic generative model consisting of a set of n-gram distributions for predicting the query
– Document models can be optimized by the expectation-maximization (EM) or minimum classification error (MCE) training algorithms, given a set of query and relevant document pairs
Li wwwwQ ....21 3m
4m
ji DwP
CwP i
jii DwwP ,1
CwwP ii ,1
1m
2mQuery
,,
14132
21
1211
CwwPmDwwPmCwPmDwPm
CwPmDwPmDQP
iijii
L
iiji
jj
14321 mmmm
Features:1. A formal mathematic framework2. Use collection statistics but not heuristics3. The retrieval system can be gradually improved through usage
Berlin Chen 12
n-Gram-Based Retrieval Model (2/2)
• MCE training– Given a query and a desired relevant doc , define the cl
assification error function as:
“>0”: means misclassified; “<=0”: means a correct decision
– Transform the error function to the loss function
– Iteratively update the weighting parameters, e.g.,
Q *D
RDQPRDQPQ
DQED
not is 'logmax is log1
),( '
** Also can take all irrelevant doc
in the answer set into consideration
)),(exp(1
1),(
**
DQEDQL
iDDm
DQLiimim **
1
*
11 ~,~1~
1m
Berlin Chen 13
n-Gram-Based Summarization Model (1/2)
• Each sentence of the spoken document is treated as a probabilistic generative model of n-grams, while the spoken document is the observation
– : the sentence model, estimated from the sentence
– : the collection model, estimated from a large corpus• In order to have some probability of every word in the vocabulary
S iD
ij
ij
Dw
DwnjjiLM CwPSwPSDP
,1
SwP jS
CwP j C
Berlin Chen 14
n-Gram-Based Summarization Model (2/2)
• Relevance Model (RM)– In order to improve the estimation of sentence models– Each sentence has its own associated relevance model
, constructed by the subset of documents in the collection that are relevant to the sentence
– The relevance model is then linearly combined with the original sentence model to form a more accurate sentence model
Sj RwPS
ljDD
lSj DwPSDPRwPLl
Top
Lu Dduu
lll DPDSP
DPDSPSDP
Top
where
Sjjj RwPSwPSwP 1ˆ
ij
ij
Dw
DwnjjiRMLM CwPSwPSDP
,1ˆ
Berlin Chen 15
Categorization of Statistical Language Models (1/4)
Berlin Chen 16
Categorization of Statistical Language Models (2/4)
1. Word-based LMs– The n-gram model is usually the basic model of this category – Many other models of this category are designed to overcome th
e major drawback of n-gram models• That is, to capture long-distance word dependence information with
out increasing the model complexity rapidly
– E.g., mixed-order Markov model and trigger-based language model, etc.
2. Word class (or topic)-based LMs– These models are similar to the n-gram model, but the relationsh
ip among words is constructed via (latent) word classes• When the relationship is established, the probability of a decoded w
ord given the history words can be easily found out
– E.g., class-based n-gram model, aggregate Markov model and word topical mixture model (WTMM)
Berlin Chen 17
Categorization of Statistical Language Models (3/4)
3. Structure-based LMs– Due to the constraints of grammars, rules for a sentence may be
derived and represented as a parse tree • Then, we can select among candidate words by the sentence patter
ns or head words of the history
– E.g., structured language model
4. Document class (or topic)-based LMs– Words are aggregated in a document to represent some topics
(or concepts). During speech recognition, the history is considered as an incomplete document and the associated latent topic distributions can be discovered on the fly
• The decoded words related to most of the topics that the history probably belongs to can be therefore selected
– E.g., mixture-based language model, probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA)
Berlin Chen 18
• Ironically, the most successful statistical language modeling techniques use very little knowledge of what language is– may be a sequence of arbitrary symbols, with
no deep structure, intention, or thought behind them
• F. Jelinek said “put language back into language modeling”– “Closing remarks” presented at the 1995 Language Modeling Summer
Workshop, Baltimore
m21 ,...,w,wwW
Categorization of Statistical Language Models (4/4)
Berlin Chen 19
Main Issues for Statistical Language Models
• Evaluation– How can you tell a good language model from a bad one– Run a speech recognizer or adopt other statistical measurement
s
• Smoothing– Deal with data sparseness of real training data– Various approaches have been proposed
• Adaptation– The subject matters and lexical characteristics for the linguistic c
ontents of utterances or documents (e.g., news articles) might be are very diverse and are often changing with time
• LMs should be adapted consequently
– Caching: If you say something, you are likely to say it again later• Adjust word frequencies observed in the current conversation
Berlin Chen 20
Evaluation (1/7)
• Two most common metrics for evaluation a language model– Word Recognition Error Rate (WER)– Perplexity (PP)
• Word Recognition Error Rate
– Requires the participation of a speech recognition system(slow!)
– Need to deal with the combination of acoustic probabilities and language model probabilities (penalizing or weighting between them)
Berlin Chen 21
Evaluation (2/7)
• Perplexity– Perplexity is geometric average inverse language model
probability (measure language model difficulty, not acoustic difficulty/confusability)
– Can be roughly interpreted as the geometric mean of the branching factor of the text when presented to the language model
– For trigram modeling:
mm
2i1i21i1
m21 )w,...,w,w|w(P
1
)w(P
1w,...,w,wPP
W
mm
3i1i2ii121
m21 )w,w|w(P
1
)ww(P
1
)w(P
1w,...,w,wPP
W
12 , ii ww
Berlin Chen 22
Evaluation (3/7)
• More about Perplexity– Perplexity is an indication of the complexity of the language if we
have an accurate estimate of – A language with higher perplexity means that the number of
words branching from a previous word is larger on average– A langue model with perplexity L has roughly the same difficulty
as another language model in which every word can be followed by L different words with equal probabilities
– Examples: • Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8,
9” – easy – perplexity 10
• Ask a speech recognizer to recognize names at a large institute (10,000 persons) – hard – perplexity 10,000
WP
Berlin Chen 23
Evaluation (4/7)
• More about Perplexity (Cont.)– Training-set perplexity: measures how the language model fits the
training data
– Test-set perplexity: evaluates the generalization capability of the language model• When we say perplexity, we mean “test-set perplexity”
Berlin Chen 24
Evaluation (5/7)
• Is a language model with lower perplexity is better?
– The true (optimal) model for data has the lowest possible perplexity
– The lower the perplexity, the closer we are to the true model
– Typically, perplexity correlates well with speech recognition word error rate
• Correlates better when both models are trained on same data
• Doesn’t correlate well when training data changes
– The 20,000-word continuous speech recognition for Wall Street Journal (WSJ) task has a perplexity about 128 ~ 176 (trigram)
– The 2,000-word conversational Air Travel Information System (ATIS) task has a perplexity less than 20
Berlin Chen 25
Evaluation (6/7)
• The perplexity of bigram with different vocabulary size
Berlin Chen 26
Evaluation (7/7)
• A rough rule of thumb (recommended by Rosenfeld)
– Reduction of 5% in perplexity is usually not practically significant
– A 10% ~ 20% reduction is noteworthy, and usually translates into some improvement in application performance
– A perplexity improvement of 30% or more over a good baseline is quite significant
Vocabulary Perplexity WER
zero |one |two |three |four
|five |six |seven |eight |nine
10 5
John |tom |sam |bon |ron |
|susan |sharon |carol |laura |sarah
10 7
bit |bite |boot |bait |bat
|bet |beat |boat |burt |bart
10 9
Tasks of recognizing 10 isolated-words using IBM ViaVoice
Perplexity cannot always
reflect the difficulty of a
speech recognition task
Berlin Chen 27
Smoothing (1/3)
• Maximum likelihood (ML) estimate of language models has been shown previously, e.g.:– Trigam probabilities
– Bigram probabilities
xyC
xyzC
xywC
xyzCxyzP
w
ML
|
xC
xyC
xwC
xyCxyP
w
ML
|
count
Berlin Chen 28
Smoothing (2/3)
• Data Sparseness – Many actually possible events (word successions) in the test set
may not be well observed in the training set/data
• E.g. bigram modeling
P(read|Mulan)=0 P(Mulan read a book)=0
P(W)=0 P(X|W)P(W)=0
– Whenever a string such that occurs during speech recognition task, an error will be made
0WPW
Berlin Chen 29
Smoothing (3/3)
• Smoothing– Assign all strings (or events/word successions) a nonzero
probability if they never occur in the training data
– Tend to make distributions flatter by adjusting lower probabilities upward and high probabilities downward
Berlin Chen 30
Smoothing: Simple Models
• Add-one smoothing– For example, pretend each trigram occurs once more than it
actually does
• Add delta smoothing
y words vocabular totalofnumber :
1
1
1|
V
VxyC
xyzC
xywC
xyzCxyzP
w
smooth
VxyC
xyzCxyzPsmooth
|
Work badly! DO NOT DO THESE TWO.
Berlin Chen 31
Smoothing: Back-Off Models
• The general form for n-gram back-off
– : normalizing/scaling factor chosen to make the conditional probability sum to 1
• I.e.,
11,..., ini ww
0,,..., if ,...,|,...,
0,,..., if ,...,|ˆ
,...,|
111211
1111
11
iiniiniismoothini
iiniinii
iniismooth
wwwCwwwPww
wwwCwwwP
wwwP
1,...,| 11 iw
iniismooth wwwP
0,,...,,12
0,,...,,11
11
11
11
,...,|
,...,|ˆ1
,..., example,For
iinij
iinii
wwwCwinijsmooth
wwwCwinii
ini wwwP
wwwP
ww
n-gram
smoothed (n-1)-gram
Berlin Chen 32
Smoothing: Interpolated Models
• The general form for Interpolated n-gram back-off
• The key difference between backoff and interpolated models – For n-grams with nonzero counts, interpolated models use infor
mation from lower-order distributions while back-off models do not
– Moreover, in interpolated models, n-grams with the same counts can have different probability estimates
12111111
11
,...,|,...,1,...,|,...,
,...,|
iniismoothiniiniiMLini
iniismooth
wwwPwwwwwPww
wwwP
11
1111 ,...,
,,...,,...,|
ini
iiniiniiML wwC
wwwCwwwP
count
Berlin Chen 33
Caching (1/2)
• The basic idea of cashing is to accumulate n-grams dictated so far in the current document/conversation and use these to create dynamic n-grams model
• Trigram interpolated with unigram cache
• Trigram interpolated with bigram cache
w
cache
cachesmoothcache
historywC
historyzC
historylength
historyzChistory|zP
history
history|zP1xy|zPxy|zP
far so dictatedn onversatiodocument/c :
historyyC
historyyzChistory,y|zP
history,y|zP1xy|zPxy|zP
cache
cachesmoothcache
Berlin Chen 34
Caching (2/2)
• Real Life of Caching – Someone says “I swear to tell the truth”– System hears “I swerve to smell the soup”– Someone says “The whole truth”, and, with cache, system hears
“The toll booth.” – errors are locked in
• Caching works well when users correct as they go, poorly or even hurts without corrections
Cache remembers!
Adopted from Joshua Goodman’s public presentation file
Berlin Chen 35
LM Integrated into Speech Recognition
• Theoretically,
• Practically, language model is a better predictor while acoustic probabilities aren’t “real” probabilities
– Penalize insertions
• E.g.,
WXWWW
PPˆ max arg
decidedy empiricall becan where
max arg
,,
PPˆ length W
W
WXWW
8
Berlin Chen 36
n-Gram Language Model Adaptation (1/4)• Count Merging
– n-gram conditional probabilities form a a multinominal distribution
• The parameters form sets of independent Dirichlet distributions with hyperparameters
– The MAP estimate is the posterior distribution of
All possible N-gram histories Vocabulary Size
Berlin Chen 37
n-Gram Language Model Adaptation (2/4)• Count Merging (cont.)
– Maximize the posterior distribution of w.r.t. the constraint
– Differentiate w.r.t. Largrange Multiplier
Berlin Chen 38
n-Gram Language Model Adaptation (3/4)• Count Merging (cont.)
– Parameterization of the prior distribution (I):
– The adaptation formula for Count Merging
• E.g.,
Background Corpus
Adaptation Corpus
Berlin Chen 39
n-Gram Language Model Adaptation (4/4)
• Model Interpolation– Parameterization of the prior distribution (II):
– The adaptation formula for Model Interpolation
• E.g.,
Berlin Chen 40
Known Weakness in Current n-Gram LM
• Brittleness Across Domain– Current language models are extremely sensitive to changes in
the style or topic of the text on which they are trained• E.g., conversations vs. news broadcasts, fictions vs. politics
– Language model adaptation• In-domain or contemporary text corpora/speech transcripts• Static or dynamic adaptation• Local contextual (n-gram) or global semantic/topical information
• False Independence Assumption– In order to remain trainable, the n-gram modeling assumes the
probability of next word in a sentence depends only on the identity of last n-1 words
• n-1-order Markov modeling
Berlin Chen 41
Conclusions
• Statistical language modeling has demonstrated to be an effective probabilistic framework for NLP, ASR, and IR-related applications
• There remains many issues to be solved for statistical language modeling, e.g., – Unknown word (or spoken term) detection– Discriminative training of language models – Adaptation of language models across different domains and
genres– Fusion of various (or different levels of) features for language
modeling • Positional Information?
• Rhetorical (structural) Information?
Berlin Chen 42
References
• J. R. Bellegarda. Statistical language model adaptation: review and perspectives. Speech Communication 42(11), 93-108, 2004
• X., W. Liu, B. Croft. Statistical language modeling for information retrieval. Annual Review of Information Science and Technology 39, Chapter 1, 2005
• R. Rosenfeld. Two decades of statistical language modeling: where do we go from here? Proceedings of IEEE, August 2000
• J. Goodman, A bit of progress in language modeling, extended version. Microsoft Research Technical Report MSR-TR-2001-72, 2001
• H.S. Chiu, B. Chen. Word topical mixture models for dynamic language model adaptation. ICASSP2007
• J.W. Kuo, B. Chen. Minimum word error based discriminative training of language models. Interspeech2005
• B. Chen, H.M. Wang, L.S. Lee. Spoken document retrieval and summarization. Advances in Chinese Spoken Language Processing, Chapter 13, 2006
Berlin Chen 43
Maximum Likelihood Estimate (MLE) for n-Grams (1/2)
• Given a a training corpus T and the language model
– Essentially, the distribution of the sample counts with the same history referred as a multinominal (polynominal) distribution
V
thLthkthth
w,wwW
wwwwT
,..., Vocabulary
......... Corpus
21
21
… 陳水扁 總統 訪問 美國 紐約 … 陳水扁 總統 在 巴拿馬 表示 … P( 總統 |陳水扁 )=?
h w
N
hw
wthkthk
i
ihw
i
thk
wwpTp
ofhistory
N-grams with same historyare collected together
ihwN
h
j
j
i
i
i
ihw
i
i
i
Vw
hwhw
hww
N
hw
whw
hhwhw NN
N
NNNPTh 1 and ,
!
! ,..., ,
1
j
jw
hwTh 1 ,
ThChwCNhwCNhwpi
iiw
ihihwhwi corpusin , , where
Berlin Chen 44
Maximum Likelihood Estimate (MLE) for n-Grams (2/2)
• Take logarithm of , we have
• For any pair , try to maximize and subject to
Tp
h w
hwhw
i
iiNTp loglog
),( jwhh
j
jw
hw ,1
h
hw
N
N
NNll
N
lNNN
lN
lN
i
h
hwhw
hw
hwhh
whw
whw
hhw
hw
hw
hw
hw
hwh
hw
hw
hw
h whwh
h whwhw
hw
i
i
s
s
j
j
s
s
V
V
i
i
i
j
j
i
ii
i
C
Cˆ
......0
1log
2
2
1
1
h whwh
j
jl 1
642
321
6
3
4
2
2
1
xdx
xd 1log