factored language models ee517 presentation april 19, 2005 kevin duh ([email protected])
Post on 21-Dec-2015
218 views
TRANSCRIPT
Factored Language ModelsFactored Language Models
EE517 Presentation
April 19, 2005
Kevin Duh ([email protected])
Factored Language Models 2
OutlineOutline
1. Motivation
2. Factored Word Representation
3. Generalized Parallel Backoff
4. Model Selection Problem
5. Applications
6. Tools
Factored Language Models 3
Word-based Language ModelsWord-based Language Models
• Standard word-based language models
• How to get robust n-gram estimates ( )?• Smoothing
• E.g. Kneser-Ney, Good-Turing
• Class-based language models
p(w1,w2 ,...,wT ) = p(wt |w1,...,wt−1)t=1
T
∏
≈ p(wt |wt−1,wt−2 )t=1
T
∏
p(wt | wt−1) ≈p(wt |C(wt))p(C(wt) |C(wt−1))
p(wt | wt−1,wt−2 )
Factored Language Models 4
Limitation of Word-based Language Models
Limitation of Word-based Language Models
• Words are inseparable whole units. • E.g. “book” and “books” are distinct vocabulary units
• Especially problematic in morphologically-rich languages:• E.g. Arabic, Finnish, Russian, Turkish• Many unseen word contexts • High out-of-vocabulary rate• High perplexity
Arabic k-t-b
Kitaab A book
Kitaab-iy My book
Kitaabu-hum Their book
Kutub Books
Factored Language Models 5
Arabic MorphologyArabic Morphology
root
pattern
LIVE + past + 1st-sg-past + part: “so I lived”
-tufa- affixesparticles sakan
• ~5000 roots• several hundred patterns• dozens of affixes
Factored Language Models 6
Vocabulary Growth - full word formsVocabulary Growth - full word forms
CallHome
0
2000
4000
6000
8000
10000
12000
14000
16000
10k20k30k40k50k60k70k80k90k100k110k120k
# word tokens
vocab size EnglishArabic
Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition - Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002
Factored Language Models 7
Vocabulary Growth - stemmed wordsVocabulary Growth - stemmed words
CallHome
0
2000
4000
6000
8000
10000
12000
14000
16000
10k20k30k40k50k60k70k80k90k100k110k120k
# word tokens
vocab size EN wordsAR wordsEN stemsAR stems
Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition - Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002
Factored Language Models 8
Solution: Word as FactorsSolution: Word as Factors
• Decompose words into “factors” (e.g. stems)• Build language model over factors: P(w|factors)• Two approaches for decomposition
• Linear • [e.g. Geutner, 1995]
• Parallel • [Kirchhoff et. al., JHU Workshop 2002]• [Bilmes & Kirchhoff, NAACL/HLT 2003]
WtWt-2 Wt-1
StSt-2 St-1
MtMt-2 Mt-1
stem suffixprefixsuffixstem
Factored Language Models 9
Factored Word RepresentationsFactored Word Representations
• Factors may be any word feature. Here we use morphological features:• E.g. POS, stem, root, pattern, etc.
€
w ≡{ f1, f 2 ,..., f K } ≡ f1:K
p(w1, w
2,..., w
T) ≡p( f1
1:K , f21:K ,..., fT
1:K )
≈ p(t=1
T
∏ ft1:K | ft−1
1:K , ft−21:K )
1 2 1 2 1 2( | , , , , , )t t t t t t tP w w w s s m m− − − − − −WtWt-2 Wt-1
StSt-2 St-1
MtMt-2 Mt-1
Factored Language Models 10
Advantage of Factored Word Representations
Advantage of Factored Word Representations
• Main advantage: Allows robust estimation of probabilities (i.e. ) using backoff• Word combinations in context may not be observed in
training data, but factor combinations are• Simultaneous class assignment
p( ft| f
t−11:K , ft−2
1:K )
Word
word
stem
root
tag
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
Kutub(Books)
kutub
kutub
ktb
noun (pl.)
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
Kitaab-iy(My book)
kitaab-iy
kitaab
ktb
noun+poss
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
Kitaabu-hum(Their book)
kitaabu-hum
kitaabu
ktb
noun+poss
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
Factored Language Models 11
ExampleExample
• Training sentence: “lAzim tiqra kutubiy bi sorca”(You have to read my books quickly)
• Test sentence: “lAzim tiqra kitAbiy bi sorca” (You have to read my book quickly)
Count(tiqra, kitAbiy, bi) = 0
Count(tiqra, kutubiy, bi) > 0
Count(tiqra, ktb, bi) > 0
P(bi| kitAbiy, tiqra) can back off to
P(bi | ktb, tiqra) to obtain more robust estimate.
=> this is better than P(bi | <unknown>, tiqra)
Factored Language Models 12
Language Model BackoffLanguage Model Backoff
• When n-gram count is low, use (n-1)-gram estimate• Ensures more robust parameter estimation in sparse data:
P(Wt | Wt-1 Wt-2)
P(Wt)
P(Wt | Wt-1)
P(Wt | Wt-1 Wt-2 Wt-3)
Backoff path: Drop most distant word during backoff
Word-based LM:
Backoff graph: multiple backoff paths possible
F | F1 F2 F3
F | F1 F3
F | F2
F | F1 F2 F | F2 F3
F | F3F | F1
F
Factored Language Model:
Factored Language Models 13
Choosing Backoff PathsChoosing Backoff Paths
• Four methods for choosing backoff path1. Fixed path (a priori)
2. Choose path dynamically during training
3. Choose multiple paths dynamically during training and combine result (Generalized Parallel Backoff)
4. Constrained version of (2) or (3) F | F1 F2 F3
F | F1 F3
F | F2
F | F1 F2 F | F2 F3
F | F3F | F1
F
Factored Language Models 14
Generalized BackoffGeneralized Backoff
• Katz Backoff:
• Generalized Backoff:
1 2
1 2( , , ) 1 2
1 21 2
1 2 1 2
( , , )if ( , , ) 0
( , )( | , )
( , ) ( , , ) otherwise
P P
P PN f f f P P
P PBO P P
P P P P
N f f fd N f f f
N f fP f f f
f f g f f fα
⎧ >⎪=⎨⎪⎩
α( fP1, fP2 ) =
1− dN( f , fP1 , fP2 )
N( f , fP1, fP2 )N( fP1, fP2 )f :N( f , fP1 , fP2 )>0
∑g( f , fP1, fP2 )
f :N( f , fP1 , fP2 )=0∑
PBO
(wt| w
t−1,wt−2 ) =dN(wt ,wt−1 ,wt−2 )
N(wt,wt−1,wt−2 )N(wt−1,wt−2 )
if N(wt,wt−1,wt−2 ) > 0
α(wt−1,wt−2 )PBO(wt |wt−1) otherwise
⎧
⎨⎪
⎩⎪
g() can be any positive function, but some g() makes backoff weight computation difficult
Factored Language Models 15
g() functionsg() functions
• A priori fixed path:
• Dynamic path: Max counts:
• Dynamic path: Max normalized counts:
1 2 1( , , ) ( | )P P BO Pg f f f P f f=
1 2 *( , , ) ( | )P P BO Pjg f f f P f f=* argmax ( , )Pj
jj N f f=
* ( , )argmax
( )Pj
j Pj
N f fj
N f=
Based on raw counts=> Favors robust estimation
Based on maximum likelihood=> Favors statistical predictability
Factored Language Models 16
Dynamically Choosing Backoff Paths During Training
Dynamically Choosing Backoff Paths During Training
• Choose backoff path based based on g() and statistics of the data
Wt | Wt-1 St-1 Tt-1
Wt | St-1 Tt-1
Wt | Wt-1
Wt | Wt-1 St-1 Wt | Wt-1 Tt-1
Wt | Tt-1Wt | St-1
Wt
Wt | Wt-1 St-1 Tt-1
Wt | Wt-1 St-1
Wt | Wt-1 St-1 Tt-1
Wt | St-1
Wt | Wt-1 St-1
Wt | Wt-1 St-1 Tt-1
Wt
Wt | St-1
Wt | Wt-1 St-1
Wt | Wt-1 St-1 Tt-1
Factored Language Models 17
Multiple Backoff Paths: Generalized Parallel Backoff
Multiple Backoff Paths: Generalized Parallel Backoff
• Choose multiple paths during training and combine probability estimates
Wt | Wt-1 St-1 Tt-1
Wt | St-1 Tt-1Wt | Wt-1 St-1 Wt | Wt-1 Tt-1
Wt | Wt-1 St-1 Tt-1
Wt | Wt-1 St-1
Wt | Wt-1 St-1 Tt-1
Wt | Wt-1 Tt-1
pbo
(wt| w
t−1,st−1,tt−1) =dcpML(wt |wt−1,st−1,tt−1) if count≥threshold
α2[ pbo(wt |wt−1,st−1) + pbo(wt |wt−1,tt−1)] else
⎧
⎨⎪
⎩⎪
Options for combination are: - average, sum, product, geometric mean, weighted mean
Factored Language Models 18
Summary: Factored Language Models
Summary: Factored Language Models
FACTORED LANGUAGE MODEL =
Factored Word Representation + Generalized Backoff
• Factored Word Representation• Allows rich feature set representation of words
• Generalized (Parallel) Backoff• Enables robust estimation of models with many
conditioning variables
Factored Language Models 19
Model Selection ProblemModel Selection Problem
• In n-grams, choose, eg. • Bigram vs. trigram vs. 4gram
=> relatively easy search; just try each and note perplexity on development set
• In Factored LM, choose:• Initial Conditioning Factors• Backoff Graph• Smoothing OptionsToo many options; need automatic searchTradeoff: Factored LM is more general, but harder to
select a good model that fits data well.
Factored Language Models 20
Example: a Factored LMExample: a Factored LM
• Initial Conditioning Factors, Backoff Graph, and Smoothing parameters completely specify a Factored Language Model
• E.g. 3 factors total:
Wt | Wt-1 St-1 Tt-1
Wt | St-1 Tt-1
Wt | Wt-1
Wt | Wt-1 St-1 Wt | Wt-1 Tt-1
Wt | Tt-1Wt | St-1
Wt
0. Begin with full graphstructure for 3 factors
Wt | Wt-1 St-1
1. Initial Factors specify start-node
Factored Language Models 21
Example: a Factored LMExample: a Factored LM
• Initial Conditioning Factors, Backoff Graph, and Smoothing parameters completely specify a Factored Language Model
• E.g. 3 factors total:
Wt | Wt-1
Wt | Wt-1 St-1
Wt | St-1
Wt
3. Begin with subgraph obtained with new root node
4. Specify backoff graph:i.e. what backoff to use at each node
Wt | Wt-1
Wt | Wt-1 St-1
Wt
5. Specify smoothingfor each edge
Factored Language Models 22
Applications for Factored LMApplications for Factored LM
• Modeling of Arabic, Turkish, Finnish, German, and other morphologically-rich languages• [Kirchhoff, et. al., JHU Summer Workshop 2002]• [Duh & Kirchhoff, Coling 2004], [Vergyri, et. al., ICSLP 2004]
• Modeling of conversational speech • [Ji & Bilmes, HLT 2004]
• Applied in Speech Recognition, Machine Translation• General Factored LM tools can also be used to obtain
various smoothed conditional probability tables for other applications outside of language modeling (e.g. tagging)
• More possibilities (factors can be anything!)
Factored Language Models 23
To explore further…To explore further…
• Factored Language Model is now part of the standard SRI Language Modeling Toolkit distribution (v.1.4.1)• Thanks to Jeff Bilmes (UW) and Andreas Stolcke (SRI)• Downloadable at:
http://www.speech.sri.com/projects/srilm/
Factored Language Models 24
fngram Toolsfngram Tools
fngram-count -factor-file my.flmspec -text train.txtfngram -factor-file my.flmspec -ppl test.txt
train.txt: “Factored LM is fun”W-Factored:P-adj W-LM:P-noun W-is:P-verb W-fun:P-adj
my.flmspecW: 2 W(-1) P(-1) my.count my.lm 3 W1,P1 W1 kndiscount gtmin 1 interpolate P1 P1 kndiscount gtmin 1 0 0 kndiscount gtmin 1
Factored Language Models 26
Turkish Language ModelTurkish Language Model
• Newspaper text from web [Hakkani-Tür, 2000]• Train: 400K tokens / Dev: 100K / Test: 90K
• Factors from morphological analyzer
Word
word
root
part-of-speech
number
case
other
inflection-group
⎡
⎣
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥⎥⎥⎥
yararmanlak
yarar
NounInf-N:A3sg
singular
Nom
Pnon
NounA3sgPnonNom+Verb+Acquire+Pos
⎡
⎣
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥⎥⎥⎥
yararmanlak
Factored Language Models 27
Turkish: Dev Set PerplexityTurkish: Dev Set Perplexity
2 593.8 555.0 556.4 539.2 -2.9
3 534.9 533.5 497.1 444.5 -10.6
4 534.8 549.7 566.5 522.2 -5.0
Ngram Word-based LM
HandFLM
RandomFLM
GeneticFLM
ppl(%)
• Factored Language Models found by Genetic Algorithms perform best
• Poor performance of high order Hand-FLM corresponds to difficulty manual search
Factored Language Models 28
Turkish: Eval Set PerplexityTurkish: Eval Set Perplexity
2 609.8 558.7 525.5 487.8 -7.2
3 545.4 583.5 509.8 452.7 -11.2
4 543.9 559.8 574.6 527.6 -5.8
Ngram Word-based LM
HandFLM
RandomFLM
GeneticFLM
ppl(%)
• Dev Set results generalizes to Eval Set => Genetic Algorithms did not overfit
• Best models used Word, POS, Case, Root factors and parallel backoff
Factored Language Models 29
Arabic Language ModelArabic Language Model
• LDC CallHome Conversational Egyptian Arabic speech transcripts • Train: 170K words / Dev: 23K / Test: 18K
• Factors from morphological analyzer • [LDC,1996], [Darwish, 2002]
Word
word
root
morphological tag
stem
pattern
⎡
⎣
⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥
Il+dOr
Il+dOr
dwr
Noun+masc-sg+article
dOr
CCC
⎡
⎣
⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥
Factored Language Models 30
Arabic: Dev Set and Eval Set Perplexity
Arabic: Dev Set and Eval Set Perplexity
Ngram Word-based LM
Hand FLM
Random FLM
Genetic FLM
ppl(%)
2 229.9 229.6 229.9 222.9 -2.9
3 229.3 226.1 230.3 212.6 -6.0
The best models used all available factors (Word, Stem, Root, Pattern, Morph), and various parallel backoffs
Ngram Word Hand Random Genetic ppl(%)
2 249.9 230.1 239.2 223.6 -2.8
3 285.4 217.1 224.3 206.2 -5.0
Dev Set perplexities
Eval Set perplexities
Factored Language Models 31
Word Error Rate (WER) ResultsWord Error Rate (WER) Results
Dev Set
Stage Word LM Baseline
Factored LM
1 57.3 56.2
2a 54.8 52.7
2b 54.3 52.5
3 53.9 52.1
Eval Set (eval 97)
Word LM Baseline
Factored LM
61.7 61.0
58.2 56.5
58.8 57.4
57.6 56.1
Factored language models gave 1.5% improvement in WER