lecture 10 - advanced language modeling · 2016-03-30 · lecture 10 advanced language modeling...
TRANSCRIPT
![Page 1: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/1.jpg)
Lecture 10
Advanced Language Modeling
Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen,Markus Nussbaum-Thom
Watson GroupIBM T.J. Watson Research CenterYorktown Heights, New York, USA
{picheny,bhuvana,stanchen,nussbaum}@us.ibm.com
30 March 2016
![Page 2: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/2.jpg)
Administrivia
Lab 3 handed back today?Lab 4 extension: due coming Monday, April 4, at 6pm.No Lab 5.Information on final projects to be announced imminently.
2 / 127
![Page 3: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/3.jpg)
IBM Watson Astor Place Field Trip!
In two days: Friday, April 1, 11am-1pm.51 Astor Place; meet in entrance lobby. Free lunch!(Going to Watson Client Experience Center on 5th floor.)
3 / 127
![Page 4: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/4.jpg)
Road Map
4 / 127
![Page 5: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/5.jpg)
Review: Language Modeling
(answer) = arg maxω
(language model)× (acoustic model)
= arg maxω
P(ω)P(x|ω)
Homophones.THIS IS OUR ROOM FOR A FOUR HOUR PERIOD .THIS IS HOUR ROOM FOUR A FOR OUR . PERIOD
Confusable sequences in general.IT IS EASY TO RECOGNIZE SPEECH .IT IS EASY TO WRECK A NICE PEACH .
5 / 127
![Page 6: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/6.jpg)
Language Modeling: Goals
Assign high probabilities to the good stuff.Assign low probabilities to the bad stuff.
Restrict choices given to AM.
6 / 127
![Page 7: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/7.jpg)
Review: N-Gram Models
Decompose probability of sequence . . .Into product of conditional probabilities.
e.g., trigram model⇒ Markov order 2⇒ . . .Remember last 2 words.
P(I LIKE TO BIKE) = P(I| . .)× P(LIKE| . I)× P(TO|I LIKE)×P(BIKE|LIKE TO)× P(/|TO BIKE)
7 / 127
![Page 8: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/8.jpg)
Estimating N-Gram Models
Maximum likelihood estimation?
PMLE(TO|I LIKE) =c(I LIKE TO)
c(I LIKE)
Smoothing!
Psmooth(wi |wi−1) =
{Pprimary(wi |wi−1) if count(wi−1wi) > 0αwi−1Psmooth(wi) otherwise
8 / 127
![Page 9: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/9.jpg)
N-Gram Models Are Great!
N-gram models are robust.Assigns nonzero probs to all word sequences.
N-gram models are easy to build.Can train on unannotated text; no iteration.
N-gram models are scalable.Can build 1+ GW models, fast; can increase n.
9 / 127
![Page 10: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/10.jpg)
Or Are They?
In fact, n-gram models are deeply flawed.Let us count the ways.
10 / 127
![Page 11: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/11.jpg)
What About Short-Distance Dependencies?
BUT THERE’S MORE .PERIOD
IT’S NOT LIMITED TO PROCTER .PERIOD
MR. ANDERS WRITES ON HEALTH CARE FOR THE JOURNAL.PERIOD
ALTHOUGH PEOPLE’S PURCHASING POWER HAS FALLEN ANDSOME HEAVIER INDUSTRIES ARE SUFFERING ,COMMA FOODSALES ARE GROWING .PERIOD
"DOUBLE-QUOTE THE FIGURES BASICALLY SHOW THATMANAGERS HAVE BECOME MORE NEGATIVE TOWARD U. S.EQUITIES SINCE THE FIRST QUARTER ,COMMA "DOUBLE-QUOTESAID ANDREW MILLIGAN ,COMMA AN ECONOMIST AT SMITH NEWCOURT LIMITED .PERIOD
P. &ERSAND G. LIFTS PRICES AS OFTEN AS WEEKLY TOCOMPENSATE FOR THE DECLINE OF THE RUBLE ,COMMA WHICHHAS FALLEN IN VALUE FROM THIRTY FIVE RUBLES TO THEDOLLAR IN SUMMER NINETEEN NINETY ONE TO THE CURRENTRATE OF SEVEN HUNDRED SIXTY SIX .PERIOD
11 / 127
![Page 12: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/12.jpg)
Poor Generalization
Lots of unseen n-grams.e.g., 350MW training⇒ 15% trigrams unseen.
Seeing “nearby” n-grams doesn’t help.
LET’S EAT STEAK ON TUESDAYLET’S EAT SIRLOIN ON THURSDAY
12 / 127
![Page 13: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/13.jpg)
Medium-Distance Dependencies?
“Medium-distance”⇔ within utterance.
FABIO WHO WAS NEXT ASKED IF THE TELLER . . .
Does a trigram model do the “right” thing?
13 / 127
![Page 14: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/14.jpg)
Generating Text From A Language Model
Reveals what word sequences model thinks is likely.e.g., P(wi |IN THE)
14 / 127
![Page 15: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/15.jpg)
Trigram Model Trained On WSJ, 20MWAND WITH WHOM IT MATTERS AND IN THE SHORT -HYPHEN TERM
AT THE UNIVERSITY OF MICHIGAN IN A GENERALLY QUIET SESSION
THE STUDIO EXECUTIVES LAW
REVIEW WILL FOCUS ON INTERNATIONAL UNION OF THE STOCK MARKET
HOW FEDERAL LEGISLATION
"DOUBLE-QUOTE SPENDING
THE LOS ANGELES
THE TRADE PUBLICATION
SOME FORTY %PERCENT OF CASES ALLEGING GREEN PREPARING FORMS
NORTH AMERICAN FREE TRADE AGREEMENT (LEFT-PAREN NAFTA)RIGHT-PAREN ,COMMA WOULD MAKE STOCKS
A MORGAN STANLEY CAPITAL INTERNATIONAL PERSPECTIVE ,COMMA GENEVA
"DOUBLE-QUOTE THEY WILL STANDARD ENFORCEMENT
THE NEW YORK MISSILE FILINGS OF BUYERS
What’s wrong?15 / 127
![Page 16: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/16.jpg)
What Are Real Utterances Like?
Don’t end/start abruptly.Have matching quotes.Are about single subject.May even be grammatical.And make sense.Why can’t n-gram models model this stuff?
16 / 127
![Page 17: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/17.jpg)
Long-Distance Dependencies?
“Long-distance”⇔ between utterance.P(ω = w1 · · ·wl) = frequency of utterance?P(~ω = ω1 · · ·ωL) = frequency of utterance sequence!
17 / 127
![Page 18: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/18.jpg)
Trigram Model Trained On WSJ, 20MWAND WITH WHOM IT MATTERS AND IN THE SHORT -HYPHEN TERM
AT THE UNIVERSITY OF MICHIGAN IN A GENERALLY QUIET SESSION
THE STUDIO EXECUTIVES LAW
REVIEW WILL FOCUS ON INTERNATIONAL UNION OF THE STOCK MARKET
HOW FEDERAL LEGISLATION
"DOUBLE-QUOTE SPENDING
THE LOS ANGELES
THE TRADE PUBLICATION
SOME FORTY %PERCENT OF CASES ALLEGING GREEN PREPARING FORMS
NORTH AMERICAN FREE TRADE AGREEMENT (LEFT-PAREN NAFTA)RIGHT-PAREN ,COMMA WOULD MAKE STOCKS
A MORGAN STANLEY CAPITAL INTERNATIONAL PERSPECTIVE ,COMMA GENEVA
"DOUBLE-QUOTE THEY WILL STANDARD ENFORCEMENT
THE NEW YORK MISSILE FILINGS OF BUYERS
What’s wrong?18 / 127
![Page 19: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/19.jpg)
What Is Real Text Like?
Adjacent utterances tend to be on same topic.And refer to same entities, e.g., Clinton.In a similar style, e.g., formal vs. conversational.Why can’t n-gram models model this stuff?
19 / 127
![Page 20: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/20.jpg)
Recap: Shortcomings of N-Gram Models
Not great at modeling short-distance dependencies.Not great at modeling medium-distance dependencies.Not great at modeling long-distance dependencies.Basically, dumb idea.
Insult to language modeling researchers.Great for me to poop on.N-gram models, . . . you’re fired!
20 / 127
![Page 21: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/21.jpg)
Part I
Language Modeling, Pre-2005-ish
21 / 127
![Page 22: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/22.jpg)
Where Are We?
1 Short-Distance Dependencies: Word Classes
2 Medium-Distance Dependencies: Grammars
3 Long-Distance Dependencies: Adaptation
4 Decoding With Advanced Language Models
5 Discussion
22 / 127
![Page 23: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/23.jpg)
Improving Short-Distance Modeling
Word n-gram models do not generalize well.LET’S EAT STEAK ON TUESDAY
LET’S EAT SIRLOIN ON THURSDAY
Idea: word n-gram⇒ class n-grams!?
PMLE([DAY] | [FOOD] [PREP]) =c([FOOD] [PREP] [DAY])
c([FOOD] [PREP])
Any instance of class trigram increases . . .Probs of all other instances of class trigram.
23 / 127
![Page 24: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/24.jpg)
Getting From Class to Word Probabilities
What we have:
P([DAY] | [FOOD] [PREP])⇔ P(ci |ci−2ci−1)
What we want:
P(THURSDAY | SIRLOIN ON)⇔ P(wi |wi−2wi−1)
Predict current word given (non-hidden) class.
P(THURSDAY | SIRLOIN ON) =
P([DAY] | [FOOD] [PREP])× P(THURSDAY | [DAY])
24 / 127
![Page 25: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/25.jpg)
How To Assign Words To Classes?
For generalization to work sensibly . . .Group “related” words in same class.
ROSE FELL DROPPED GAINED JUMPED CLIMBED SLIPPEDHEYDAY MINE’S STILL MACHINE NEWEST HORRIFIC BEECH
With vocab sizes of 50,000+, can’t do this manually.⇒ Unsupervised clustering!
25 / 127
![Page 26: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/26.jpg)
Word Clustering (Brown et al., 1992)
Class trigram model.
P(wi |wi−2wi−1) = P(ci |ci−2ci−1)× P(wi |ci)
Idea: choose classes to optimize training likelihood (MLE)!?Simplification: use class bigram model.
P(wi |wi−1) = P(ci |ci−1)× P(wi |ci)
Fix number of classes, e.g., 1000; hill-climbing search.
26 / 127
![Page 27: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/27.jpg)
Example Classes, 900MW Training Data
OF
THE TONIGHT’S SARAJEVO’S JUPITER’S PLATO’S CHILDHOOD’SGRAVITY’S EVOLUTION’SAS BODES AUGURS BODED AUGURED
HAVE HAVEN’T WHO’VE
DOLLARS BARRELS BUSHELS DOLLARS’ KILOLITERS
MR. MS. MRS. MESSRS. MRS
HIS SADDAM’S MOZART’S CHRIST’S LENIN’S NAPOLEON’S JESUS’ARISTOTLE’S DUMMY’S APARTHEID’S FEMINISM’SROSE FELL DROPPED GAINED JUMPED CLIMBED SLIPPED TOTALED
EASED PLUNGED SOARED SURGED TOTALING AVERAGED TUMBLED
SLID SANK SLUMPED REBOUNDED PLUMMETED DIPPED FIRMED
RETREATED TOTALLING LEAPED SHRANK SKIDDED ROCKETED SAGGED
LEAPT ZOOMED SPURTED RALLIED TOTALLED NOSEDIVED
27 / 127
![Page 28: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/28.jpg)
Class N-Gram Model Performance (WSJ)
20
24
28
32
36
20kw 200kw 2MW 20MW
WE
R
training set size
word n-gramclass n-gram
28 / 127
![Page 29: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/29.jpg)
Combining Models: Linear Interpolation
A “hammer” for combining models.
Pcombine(·|·) = λ× P1(·|·) + (1− λ)× P2(·|·)
Combined model probabilities sum to 1 correctly.Easy to train λ to maximize likelihood of data. (How?)
Pcombine(wi |wi−2wi−1) = λ× Pword(wi |wi−2wi−1)+
(1− λ)× Pclass(wi |wi−2wi−1)
29 / 127
![Page 30: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/30.jpg)
Combining Word and Class N-Gram Models
20
24
28
32
36
20kw 200kw 2MW 20MW
WE
R
training set size
word n-gramclass n-graminterpolated
30 / 127
![Page 31: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/31.jpg)
Discussion: Class N-Gram Models
Smaller than word n-gram models.N-gram model over vocab of ∼1000, not ∼50000.Interpolation⇒ overall model larger.
Easy to add new words to vocabulary.Only need to initialize P(wnew | cnew).
P(wi |wi−2wi−1) = P(ci |ci−2ci−1)× P(wi |ci)
31 / 127
![Page 32: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/32.jpg)
Where Are We?
1 Short-Distance Dependencies: Word Classes
2 Medium-Distance Dependencies: Grammars
3 Long-Distance Dependencies: Adaptation
4 Decoding With Advanced Language Models
5 Discussion
32 / 127
![Page 33: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/33.jpg)
Modeling Medium-Distance Dependencies
N-gram models predict identity of next word . . .Based on identities of words in fixed positions in past.
Important words for prediction may occur elsewhere.Important word for predicting SAW is DOG.
S
����
HHHH
NP�� HH
DETTHE
NDOG
VP�� HH
VSAW
PNROY
33 / 127
![Page 34: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/34.jpg)
Modeling Medium-Distance Dependencies
Important words for prediction may occur elsewhere.Important word for predicting SAW is DOG.
Instead of condition on fixed number of words back . . .Condition on words in fixed positions in parse tree!?
S
�����
HHHH
H
NP
����
HHHH
NP�� HH
DETTHE
NDOG
PP�� HH
PON
ATOP
VP�� HH
VSAW
PNROY
34 / 127
![Page 35: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/35.jpg)
Using Grammatical Structure
Each constituent has headword.Condition on preceding exposed headwords?
SSAW
�����
HHHHH
NPDOG
����
HHH
H
NPDOG�� HH
DETTHE
NDOG
PPON�� HH
PON
ATOP
VPSAW�� HH
VSAW
PNROY
35 / 127
![Page 36: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/36.jpg)
Using Grammatical Structure
Predict next word based on preceding exposed headwords.
P( THE | . . )P( DOG | . THE )P( ON | . DOG )P( TOP | DOG ON )P( SAW | . DOG )P( ROY | DOG SAW )
Picks most relevant preceding words . . .Regardless of position.
Structured language model (Chelba and Jelinek, 2000).
36 / 127
![Page 37: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/37.jpg)
Hey, Where Do Parse Trees Come From?
Come up with grammar rules . . .S → NP VP
NP → DET N | PN | NP PPN → dog | cat
Come up with probabilistic parametrization.
PMLE(S→ NP VP) =c(S→ NP VP)
c(S)
Can extract rules and train probabilities using treebank.e.g., Penn Treebank (Switchboard, WSJ text).
37 / 127
![Page 38: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/38.jpg)
So, Does It Work?
SLM trained on 20MW WSJ; trigram model: 40MW.
12
12.5
13
13.5
14
14.5
15
SLM 3g interp
WE
R
38 / 127
![Page 39: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/39.jpg)
Recap: Structured Language Modeling
Grammatical language models not yet ready for prime time.Need manually-parsed data to bootstrap parser.Training/decoding is expensive, hard to implement.
If have exotic LM and need publishable results . . .Interpolate with trigram model (“ROVER effect”).
39 / 127
![Page 40: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/40.jpg)
Where Are We?
1 Short-Distance Dependencies: Word Classes
2 Medium-Distance Dependencies: Grammars
3 Long-Distance Dependencies: Adaptation
4 Decoding With Advanced Language Models
5 Discussion
40 / 127
![Page 41: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/41.jpg)
Modeling Long-Distance DependenciesA group including Phillip C. Friedman , a Gardena , California ,investor , raised its stake in Genisco Technology Corporation toseven . five % of the common shares outstanding .
Neither officials of Compton , California - based Genisco , anelectronics manufacturer , nor Mr. Friedman could be reached forcomment .
In a Securities and Exchange Commission filing , the group said itbought thirty two thousand common shares between Augusttwenty fourth and last Tuesday at four dollars and twenty five centsto five dollars each .
The group might buy more shares , its filing said .
According to the filing , a request by Mr. Friedman to be put onGenisco’s board was rejected by directors .
Mr. Friedman has requested that the board delay Genisco’sdecision to sell its headquarters and consolidate several divisionsuntil the decision can be " much more thoroughly examined todetermine if it is in the company’s interests , " the filing said .
41 / 127
![Page 42: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/42.jpg)
Modeling Long-Distance Dependencies
Observation: words and phrases in previous sentences . . .Are more likely to occur in future sentences.
Language model adaptation.Adapt language model to current style or topic.
P(wi |wi−2wi−1)⇒ Padapt(wi |wi−2wi−1)
P(wi |wi−2wi−1)⇒ P(wi |wi−2wi−1,Hi)
Distribution over utterances P(ω = w1 · · ·wl) . . .⇒ Utterance sequences P(~ω = ω1 · · ·ωL).
42 / 127
![Page 43: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/43.jpg)
Cache Language Models
How to boost probabilities of recently-occurring words?Idea: build language model on last k = 500 words, say.How to combine with primary language model?
Pcache(wi |wi−2wi−1,w i−1i−500) =
λ× Pstatic(wi |wi−2wi−1) + (1− λ)× Pw i−1i−500
(wi |wi−2wi−1)
Cache language models (Kuhn and De Mori, 1990).
43 / 127
![Page 44: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/44.jpg)
Beyond Cache Language Models
What’s the problem?Does seeing THE boost the probability of THE?Does seeing MATSUI boost the probability of YANKEES?
Can we induce which words trigger which other words?How might one find trigger pairs?
HENSON MUPPETSTELESCOPE ASTRONOMERS
CLOTS DISSOLVERNODES LYMPHSPINKS HEAVYWEIGHT
DYSTROPHY MUSCULARFEEDLOTS FEEDLOT
SCHWEPPES MOTT’S
44 / 127
![Page 45: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/45.jpg)
Trigger Language Models
How to combine with primary language model?Linear interpolation with trigger unigram?
Ptrig(wi |wi−2wi−1,w i−1i−500) =
λ× Pstatic(wi |wi−2wi−1) + (1− λ)× Pw i−1i−500
(wi)
Another way: maximum entropy models (Lau et al., 1993).
45 / 127
![Page 46: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/46.jpg)
Beyond Trigger Language Models
Some groups of words are mutual triggers.e.g., IMMUNE, LIVER, TISSUE, TRANSPLANTS, etc.Difficult to discover all pairwise relations: sparse.
May not want to trigger words based on single event.Some words are ambiguous.e.g., LIVER ⇒ TRANSPLANTS or CHICKEN?
⇒ Topic language models.
46 / 127
![Page 47: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/47.jpg)
Example: Seymore and Rosenfeld (1997)
Assign topics to documents.e.g., politics, medicine, Monica Lewinsky, cooking, etc.Manual labels (e.g., BN) or unsupervised clustering.
For each topic, build topic-specific LM.Decoding.
1st pass: use generic LM.Select topic LM’s maximizing likelihood of 1st pass.Re-decode using topic LM’s.
47 / 127
![Page 48: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/48.jpg)
Example: Seymore and Rosenfeld (1997)
Training (transcript); topics: conspiracy; JFK assassination.
THEY WERE RIDING THROUGH DALLAS WITH THE KENNEDYSWHEN THE FAMOUS SHOTS WERE FIREDHE WAS GRAVELY WOUNDEDHEAR WHAT GOVERNOR AND MRS. JOHN CONNALLY THINK OFTHE CONSPIRACY MOVIE J. F. K. . . .
Test (decoded); topics: ???
THE MURDER OF J. F. K. WAS IT A CONSPIRACYSHOULD SECRET GOVERNMENT FILES BE OPENED TO THEPUBLICCAN THE TRAGIC MYSTERY EVER BE SATISFACTORILYRESOLVED . . .
48 / 127
![Page 49: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/49.jpg)
Example: Seymore and Rosenfeld (1997)
Topic LM’s may be sparse.Combine with general LM.
How to combine selected topic LM’s and general LM?Linear interpolation!
Ptopic(wi |wi−2wi−1) =
λ0Pgeneral(wi |wi−2wi−1) +T∑
t=1
λtPt(wi |wi−2wi−1)
49 / 127
![Page 50: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/50.jpg)
So, Do Cache Models Work?
Um, -cough-, kind of.Good PP gains (up to ∼20%).WER gains: little to none.
e.g., (Iyer and Ostendorf, 1999; Goodman, 2001).
50 / 127
![Page 51: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/51.jpg)
What About Trigger and Topic Models?
Triggers.Good PP gains (up to ∼30%)WER gains: unclear; e.g., (Rosenfeld, 1996).
Topic models.Good PP gains (up to ∼30%)WER gains: up to 1% absolute.e.g., (Iyer and Ostendorf, 1999; Goodman, 2001).
51 / 127
![Page 52: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/52.jpg)
Recap: Adaptive Language Modeling
ASR errors can cause adaptation errors.In lower WER domains, LM adaptation may help more.
Large PP gains, but small WER gains.What’s the dillio?
Increases system complexity for ASR.e.g., how to adapt LM scores with static decoding?
Unclear whether worth the effort.
52 / 127
![Page 53: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/53.jpg)
Where Are We?
1 Short-Distance Dependencies: Word Classes
2 Medium-Distance Dependencies: Grammars
3 Long-Distance Dependencies: Adaptation
4 Decoding With Advanced Language Models
5 Discussion
53 / 127
![Page 54: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/54.jpg)
Decoding, Class N-Gram Models
Can we build the one big HMM?Start with class n-gram model as FSA.Expand each class to all members.
54 / 127
![Page 55: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/55.jpg)
Handling Complex Language Models
One big HMM: static graph expansion.Heavily-pruned n-gram language model.
Another approach: dynamic graph expansion.Don’t store whole graph in memory.Build parts of graph with active states on the fly.
one
two
three
four
�ve
six
seveneight
nine
zero
one
two
three
. . . . . .
�
�:AH
�:IY
THE:DH
DOG:D
�:G
�:AO
55 / 127
![Page 56: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/56.jpg)
Dynamic Graph Expansion: The Basic Idea
Express graph as composition of two smaller graphs.Composition is associative.
Gdecode = L ◦ TLM→CI ◦ TCI→CD ◦ TCD→GMM
= L ◦ (TLM→CI ◦ TCI→CD ◦ TCD→GMM)
Can do on-the-fly composition.States in result correspond to state pairs (s1, s2).
56 / 127
![Page 57: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/57.jpg)
Another Way: Two-Pass Decoding
First-pass decoding: use simpler model . . .To find “likeliest” word sequences . . .As lattice (WFSA) or flat list of hypotheses (N-best list).
Rescoring: use complex model . . .To find best word sequence in lattice/list.
57 / 127
![Page 58: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/58.jpg)
Lattice Generation and Rescoring
THE
THIS
THUD
DIG
DOG
DOG
DOGGY
ATE
EIGHT
MAY
MY
MAY
In Viterbi, store k -best tracebacks at each word-end cell.To add in new LM scores to lattice . . .
What operation can we use?
58 / 127
![Page 59: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/59.jpg)
N-Best List Rescoring
For exotic models, even lattice rescoring may be too slow.Easy to generate N-best lists from lattices (A∗ algorithm).
THE DOG ATE MYTHE DIG ATE MYTHE DOG EIGHT MAYTHE DOGGY MAY
59 / 127
![Page 60: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/60.jpg)
Discussion: A Tale of Two Decoding Styles
Approach 1: Dynamic graph expansion (since late 1980’s).Can handle more complex language models.Decoders are incredibly complex beasts.e.g., cross-word CD expansion without FST’s.Graph optimization difficult.
Approach 2: Static graph expansion (AT&T, late 1990’s).Enabled by optimization algorithms for WFSM’s.Much cleaner way of looking at everything!FSM toolkits/libraries can do a lot of work for you.Static graph expansion is complex, but offline!Decoding is relatively simple.
60 / 127
![Page 61: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/61.jpg)
Static or Dynamic? Two-Pass?
If speed is priority?If flexibility is priority?
e.g., update LM vocabulary every night.If need gigantic language model?If latency is priority?
What can’t we use?If accuracy is priority (all the time in the world)?If doing cutting-edge research?
61 / 127
![Page 62: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/62.jpg)
Where Are We?
1 Short-Distance Dependencies: Word Classes
2 Medium-Distance Dependencies: Grammars
3 Long-Distance Dependencies: Adaptation
4 Decoding With Advanced Language Models
5 Discussion
62 / 127
![Page 63: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/63.jpg)
Recap
Short-distance dependencies.Interpolate class n-gram with word n-gram.<1% absolute WER gain; pain to implement?
Medium-distance dependencies.Interpolate grammatical LM with word n-gram.<1% absolute WER gain; pain to implement.
Long-distance dependencies.Interpolate adaptive LM with static n-gram.<1% absolute WER gain; pain to implement.
PP 6= WER.
63 / 127
![Page 64: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/64.jpg)
Turning It Up To Eleven (Goodman, 2001)
If short, medium, and long-distance modeling . . .All achieve ∼1% WER gain . . .What if combine them all with linear interpolation?
“A Bit of Progress in Language Modeling”.Combined higher order n-grams, skip n-grams, . . .Class n-grams, cache models, sentence mixtures.Achieved 50% reduction in PP over word trigram.⇒ ∼1% WER gain (WSJ N-best list rescoring).
64 / 127
![Page 65: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/65.jpg)
State of the Art Circa 2005
Commercial systems.Word n-gram models.
Research systems, e.g., government evaluations.No time limits; tiny differences in WER matter.Interpolation of word 4-gram models.
Why aren’t people using ideas from LM research?Too slow (1st pass decoding; rescoring?)Gains not reproducible with largest data sets.
65 / 127
![Page 66: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/66.jpg)
Time To Give Up?
. . . we argue that meaningful, practical reductions inword error rate are hopeless. We point out that trigramsremain the de facto standard not because we don’tknow how to beat them, but because no improvementsjustify the cost.
— Joshua Goodman (2001)
66 / 127
![Page 67: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/67.jpg)
References
P.F. Brown, V.J. Della Pietra, P.V. deSouza, J.C. Lai, R.L.Mercer, ”Class-based n-gram models of natural language”,Computational Linguistics, vol. 18, no. 4, pp. 467–479, 1992.
C. Chelba and F. Jelinek, “Structured language modeling”,Computer Speech and Language, vol. 14, pp. 283–332,2000.
J.T. Goodman, “A Bit of Progress in Language Modeling”,Microsoft Research technical report MSR-TR-2001-72,2001.
R. Iyer and M. Ostendorf, “Modeling Long DistanceDependence in Language: Topic Mixtures vs. DynamicCache Models”, IEEE Transactions on Speech and AudioProcessing, vol. 7, no. 1, pp. 30–39, 1999.
67 / 127
![Page 68: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/68.jpg)
References (cont’d)
R. Kuhn and R. De Mori, “A Cache-Based Natural LanguageModel for Speech Reproduction”, IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 12, no. 6, pp.570–583, 1990.
R. Lau, R. Rosenfeld, S. Roukos, “Trigger-based LanguageModels: A Maximum Entropy Approach”, ICASSP, vol. 2, pp.45–48, 1993.
R. Rosenfeld, “A Maximum Entropy Approach to AdaptiveStatistical Language Modeling”, Computer Speech andLanguage, vol. 10, pp. 187–228, 1996.
K. Seymore and R. Rosenfeld, “Using Story Topics forLanguage Model Adaptation”, Eurospeech, 1997.
68 / 127
![Page 69: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/69.jpg)
Part II
Language Modeling, Post-2005-ish
69 / 127
![Page 70: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/70.jpg)
What Up?
Humans use short, medium, and long-distance info.Sources of info seem complementary.Yet, linear interpolation fails to yield cumulative gains.View: interpolation⇔ averaging; dilutes each model.Maybe instead of hammer, need screwdriver?
70 / 127
![Page 71: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/71.jpg)
Is There Another Way?
Can we combine multiple information sources . . .Such that resulting language model . . .Enforces each information source fully!?
71 / 127
![Page 72: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/72.jpg)
Where Are We?
1 Introduction to Maximum Entropy Modeling
2 N-Gram Models and Smoothing, Revisited
3 Maximum Entropy Models, Part III
4 Neural Net Language Models
5 Discussion
72 / 127
![Page 73: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/73.jpg)
Many Models Or One?
Old view.Build one model for each information source.Interpolate.
New view.Build one model.
73 / 127
![Page 74: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/74.jpg)
Types of Information To Combine
Word n-gram.Class n-gram.Grammatical information.Cache, triggers.Topic information.Is there a common way to express all this?
74 / 127
![Page 75: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/75.jpg)
The Basic Intuition
Say we have 1M utterances of training data D.
FEDERAL HOME LOAN MORTGAGE CORPORATION –DASH ONE.POINT FIVE BILLION DOLLARS OF REALESTATE MORTGAGE-HYPHEN INVESTMENT CONDUIT SECURITIES OFFERED BYMERRILL LYNCH &ERSAND COMPANY .PERIOD
NONCOMPETITIVE TENDERS MUST BE RECEIVED BY NOONEASTERN TIME THURSDAY AT THE TREASURY OR AT FEDERALRESERVE BANKS OR BRANCHES .PERIOD . . .
Train LM P(ω) on D; generate 1M utterances D′.If THE occurs 1.062× 106 times in D . . .How many times should occur in D′?
75 / 127
![Page 76: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/76.jpg)
Marginals
Frequency of THE in D and D′ should match:
cD(THE) = cD′(THE)
= N∑
h,w :hw=...THE
P(h,w)
= N∑
h,w :hw=...THE
cD(h)P(w |h)
Bigram: hw = . . . OF THE.Class bigram: hw ends in classes [FOOD] [PREP].Grammar: hw is prefix of grammatical sentence.Trigger: HENSON in last 500 words of h and w = MUPPETS.
76 / 127
![Page 77: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/77.jpg)
Marginal Constraints
Binary feature functions, e.g., hw = . . . THE
fTHE(h,w) =
{1 if hw = . . . THE0 otherwise
cD(THE) = N∑
h,w :hw=...THE
cD(h)P(w |h)
= N∑h,w
cD(h)P(w |h)fTHE(h,w)
Select model P(w |h) satisfying all marginals.Which one!?
77 / 127
![Page 78: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/78.jpg)
Maximum Entropy Principle (Jaynes, 1957)
The entropy H(P) of P(w |h) is
H(P) = −∑ω
P(h,w) log P(w |h)
Entropy⇔ uniformness⇔ least assumptions.Of models satisfying constraints . . .
Pick one with highest entropy!Capture constraints; assume nothing more!
78 / 127
![Page 79: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/79.jpg)
Can We Find the Maximum Entropy Model?
Features f1(x , y), . . . , fF (x , y); parameters Λ = {λi , . . . , λF}.ME model satisfying associated constraints has form:
PΛ(w |h) =1
ZΛ(h)
∏i:fi (h,w)=1
eλi
log PΛ(w |h) =F∑
i=1
λi fi(h,w)− log ZΛ(h)
ZΛ(h) = normalizer =∑
w exp(∑F
i=1 λi fi(h,w)).a.k.a. exponential model, log-linear model.
79 / 127
![Page 80: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/80.jpg)
How to Find the λi ’s?
PΛ(w |h) =1
ZΛ(h)
∏i:fi (h,w)=1
eλi
{λi}’s satisfying constraints are MLE’s!Training set likelihood is convex function of {λi}!
80 / 127
![Page 81: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/81.jpg)
Recap: Maximum Entropy Modeling
Elegant as all hell.Principled way to combine lots of information sources.
Design choice: which constraints to enforce?Single global optimum when training parameters.Interpolation: addition; ME: multiplication.But does it blend?
81 / 127
![Page 82: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/82.jpg)
Where Are We?
1 Introduction to Maximum Entropy Modeling
2 N-Gram Models and Smoothing, Revisited
3 Maximum Entropy Models, Part III
4 Neural Net Language Models
5 Discussion
82 / 127
![Page 83: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/83.jpg)
Maximum Entropy N-gram Models?
Can ME help build a better word n-gram model?One constraint per seen n-gram.
fI LIKE BIG(h,w) =
{1 if hw = . . . I LIKE BIG0 otherwise
Problem: MLE model is same as before!!!
83 / 127
![Page 84: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/84.jpg)
Smoothing for Exponential Models
Point: don’t want to match training counts exactly!e.g., `2
2 regularization (e.g., Chen and Rosenfeld, 2000).
obj fn = LLtrain +1
(# train wds)
F∑i=1
λ2i
2σ2
The smaller |λi | is, the smaller its effect . . .And the smoother the model.
84 / 127
![Page 85: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/85.jpg)
Smoothing for Exponential Models (WSJ 4g)
20
24
28
32
36
20kw 200kw 2MW 20MW
WE
R
training set size
Katzexp n-gram
85 / 127
![Page 86: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/86.jpg)
Yay!?
Smoothed exponential n-gram models perform well.Why don’t people use them?
Conventional n-gram: count and normalize.Exponential n-gram: 50 rounds of iterative scaling.
Is there way to do constraint-based modeling . . .Within conventional n-gram framework?
86 / 127
![Page 87: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/87.jpg)
Kneser-Ney Smoothing (1995)
Back-off smoothing.
PKN(wi |wi−1) =
{Pprimary(wi |wi−1) if c(wi−1wi) > 0αwi−1PKN(wi) otherwise
PKN(wi) chosen such that . . .Unigram constraints met exactly.
87 / 127
![Page 88: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/88.jpg)
Kneser-Ney Smoothing
Unigram probabilities PKN(wi) . . .Not proportional to how often unigram occurs.
PKN(wi) 6= c(wi)∑wi
c(wi)
Proportional to how many word types unigram follows!
N1+(•wi) ≡ |{wi−1 : c(wi−1wi) > 0}|
PKN(wi) =N1+(•wi)∑wi
N1+(•wi)
† Check out “A Hierarchical Bayesian Language Model based onPitman-Yor Processes”, Teh, 2006, for a cool Bayesian interpretation andlearn about Chinese restaurant processes.
88 / 127
![Page 89: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/89.jpg)
Kneser-Ney Smoothing
20
24
28
32
36
20kw 200kw 2MW 20MW
WE
R
training set size
Katzexp n-gram
modified Kneser-Ney
89 / 127
![Page 90: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/90.jpg)
Recap: N-Gram Models and Smoothing
Best n-gram smoothing methods are all constraint-based.Can express smoothed n-gram models as . . .
Exponential models with simple `22 smoothing.
“Modified” interpolated Kneser-Ney smoothing† . . .Yields similar model, but much faster training.Standard in literature for last 15+ years.
Available in SRI LM toolkit.http://www.speech.sri.com/projects/srilm/
†(Chen and Goodman, 1998).90 / 127
![Page 91: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/91.jpg)
Where Are We?
1 Introduction to Maximum Entropy Modeling
2 N-Gram Models and Smoothing, Revisited
3 Maximum Entropy Models, Part III
4 Neural Net Language Models
5 Discussion
91 / 127
![Page 92: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/92.jpg)
What About Other Features?
Exponential models make slightly better n-gram models.Snore.
Can we just toss in tons of cool features . . .And get fabulous results?
92 / 127
![Page 93: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/93.jpg)
Maybe!? (Rosenfeld, 1996)
38M words of WSJ training data.Trained maximum entropy model with . . .
Word n-gram; skip n-gram; trigger features.Interpolated with regular word n-gram and cache.
39% reduction in PP, 2% absolute reduction in WER.Baseline: (pruned) Katz-smoothed(?) trigram model.
Contrast: Goodman (2001), -50% PP, -0.9% WER.
93 / 127
![Page 94: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/94.jpg)
What’s the Catch?
200 computer-days to train.Really slow training.
For each word, update O(|V |) counts.Tens of passes through training data.
Really slow evaluation: evaluating ZΛ(h).
PΛ(w |h) =1
ZΛ(h)
∏i:fi (h,w)=1
eλi
ZΛ(h) =∑w ′
exp(F∑
i=1
λi fi(h,w ′))
94 / 127
![Page 95: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/95.jpg)
Newer Developments
Fast training: optimizations for simple feature sets.e.g., train word n-gram model on 1GW in few hours.
Fast evaluation: unnormalized models.Not much slower than regular word n-gram.
PΛ(w |h) =∏
i:fi (h,w)=1
eλi
Performance prediction.How to intelligently select feature types.
95 / 127
![Page 96: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/96.jpg)
Performance Prediction (Chen, 2008)
Given training set and test set from same distribution.Desire: want to optimize performance on test set.Reality: only have access to training set.
(test perf) = (training perf) + (overfitting penalty)
Can we estimate overfitting penalty?
96 / 127
![Page 97: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/97.jpg)
Yes
log PPtest − log PPtrain ≈ 0.938(# train wds)
F∑i=1
|λi |
0
1
2
3
4
5
6
0 1 2 3 4 5 6
pred
icte
d
actual97 / 127
![Page 98: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/98.jpg)
A Tool for Good
Holds for many different types of data.Different domains; languages; token types; . . .Vocab sizes; training set sizes; n-gram orders.
Holds for many different types of exponential models.Explains lots of diverse aspects of language modeling.Can choose features types . . .
To intentionally shrink∑F
i=1 |λi |.
98 / 127
![Page 99: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/99.jpg)
Model M (Chen, 2008; Chen and Chu, 2010)
Old-timey class-based model (Brown, 1992).Class prediction features: ci−2ci−1ci .Word prediction features: ciwi .
P(wi |wi−2wi−1) = P(ci |ci−2ci−1)× P(wi |ci)
Start from word n-gram model; convert to class model . . .And choose feature types to reduce overfitting.Class prediction features: ci−2ci−1ci , wi−2wi−1ci .Word prediction features: wi−2wi−1ciwi .
Without interpolation with word n-gram model.
99 / 127
![Page 100: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/100.jpg)
Model M (WSJ)
20
24
28
32
36
20kw 200kw 2MW 20MW
WE
R
training set size
Katz 3gmodel M 4g
100 / 127
![Page 101: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/101.jpg)
Recap: Maximum Entropy
Some of best WER results in LM literature.Gain of up to 3% absolute WER over trigram (not <1%).Short-range dependencies only.
Can surpass linear interpolation in WER in many contexts.Log-linear interpolation.Each is appropriate in different situations. (When?)Together, powerful tool set for model combination.
Performance prediction explains existing models . . .And helps design new ones!
Training can be painful depending on features.
101 / 127
![Page 102: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/102.jpg)
Where Are We?
1 Introduction to Maximum Entropy Modeling
2 N-Gram Models and Smoothing, Revisited
3 Maximum Entropy Models, Part III
4 Neural Net Language Models
5 Discussion
102 / 127
![Page 103: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/103.jpg)
Introduction
Ways to combine information sources.Linear interpolation.Exponential/log-linear models.Anything else?
Recently, good results with neural networks.
103 / 127
![Page 104: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/104.jpg)
What is the Brain Like?
Layered; mostly “forward” connections.Each neuron has firing rate.
104 / 127
![Page 105: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/105.jpg)
What is a Basic Neural Network Like?
Layered; only “forward” connections and full.Each neuron has activation.
105 / 127
![Page 106: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/106.jpg)
How Are Activations Computed?
Linear function of previous layer . . .Then, non-linearity⇒ [0,1] (maybe).
e.g., sigmoid: g(x) = 11+e−x .
Saturation; universal function approximator.
x l+1i = g(
Nl∑j=1
w lijx
lj )
106 / 127
![Page 107: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/107.jpg)
Encoding Inputs and Outputs
e.g., how to model P(wi |wi−1)?1 of V coding: V nodes, one on and the rest off.
107 / 127
![Page 108: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/108.jpg)
Probability Estimation
How to make final layer activations act like probs?Use softmax function.
pi =exL
i∑NLj=1 exL
j
Use log likelihood as training objective function.a.k.a. cross-entropy.
108 / 127
![Page 109: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/109.jpg)
What’s the Big Deal?
Multi-layer models are better than single-layer models.Can actually train them scalably!
Backpropogation⇒ gradient-descent.Unlike graphical models.
Computers are fast enough now, e.g., GPU’s.
109 / 127
![Page 110: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/110.jpg)
Perspective: Exponential Models
Last layer: log-linear + softmax⇒ maxent!Like exponential/ME model, but learn features!
110 / 127
![Page 111: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/111.jpg)
NNLM’s 1.0 (Schwenk and Gauvain, 2005)
time consuming and the algorithms used with sev-eral tens of millions examples may be impracticablefor larger amounts. Training back-off LMs on largeamounts of data is not a problem, as long as power-ful machines with enough memory are available inorder to calculate the word statistics. Practice hasalso shown that back-off LMs seem to perform verywell when large amounts of training data are avail-able and it is not clear that the above mentioned newapproaches are still of benefit in this situation.
In this paper we compare the neural networklanguage model ton-gram model with modifiedKneser-Ney smoothing using LM training corporaof up to 600M words. New algorithms are pre-sented to effectively train the neural network on suchamounts of data and the necessary capacity is ana-lyzed. The LMs are evaluated in a real-time state-of-the-art speech recognizer for French BroadcastNews. Word error reductions of up to 0.5% abso-lute are reported.
2 Architecture of the neural network LM
The basic idea of the neural network LM is to projectthe word indices onto a continuous space and to usea probability estimator operating on this space (Ben-gio and Ducharme, 2001; Bengio et al., 2003). Sincethe resulting probability functions are smooth func-tions of the word representation, better generaliza-tion to unknownn-grams can be expected. A neuralnetwork can be used to simultaneously learn the pro-jection of the words onto the continuous space andto estimate then-gram probabilities. This is still an-gram approach, but the LM posterior probabilitiesare ”interpolated” for any possible context of lengthn-1 instead of backing-off to shorter contexts.
The architecture of the neural networkn-gramLM is shown in Figure 1. A standard fully-connected multi-layer perceptron is used. Theinputs to the neural network are the indices ofthe n−1 previous words in the vocabularyhj =wj−n+1, ..., wj−2, wj−1 and the outputs are the pos-terior probabilities ofall words of the vocabulary:
P (wj = i|hj) ∀i ∈ [1, N ] (1)
whereN is the size of the vocabulary. The inputuses the so-called 1-of-n coding, i.e., thei-th wordof the vocabulary is coded by setting thei-th ele-ment of the vector to 1 and all the other elements to
projectionlayer hidden
layer
outputlayerinput
projectionsshared
continuousrepresentation: representation:
indices in wordlist
LM probabilitiesdiscretefor all words
probability estimation
Neural Network
N
wj−1 P
H
N
P (wj=1|hj)wj−n+1
wj−n+2
P (wj=i|hj)
P (wj=N|hj)
P dimensional vectors
ck
oiM
Vdj
p1 =
pN =
pi =
Figure 1: Architecture of the neural networklanguage model. hj denotes the contextwj−n+1, ..., wj−1. P is the size of one projec-tion and H and N is the size of the hidden andoutput layer respectively. When shortlists are usedthe size of the output layer is much smaller then thesize of the vocabulary.
0. Thei-th line of theN ×P dimensional projectionmatrix corresponds to the continuous representationof thei-th word. Let us denoteck these projections,dj the hidden layer activities,oi the outputs,pi theirsoftmax normalization, andmjl, bj , vij andki thehidden and output layer weights and the correspond-ing biases. Using these notations the neural networkperforms the following operations:
dj = tanh
(∑l
mjl cl + bj
)(2)
oi =∑j
vij dj + ki (3)
pi = eoi /N∑
k=1
eok (4)
The value of the output neuronpi corresponds di-rectly to the probabilityP (wj = i|hj). Training isperformed with the standard back-propagation algo-rithm minimizing the following error function:
E =N∑
i=1
ti log pi + β(∑jl
m2jl +
∑ij
v2ij) (5)
whereti denotes the desired output, i.e., the proba-bility should be 1.0 for the next word in the training
111 / 127
![Page 112: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/112.jpg)
Recurrent NN LM’s (Mikolov et al., 2010)
EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGEMODEL
Tomas Mikolov1,2, Stefan Kombrink1, Lukas Burget1, Jan “Honza” Cernocky1, Sanjeev Khudanpur2
1Brno University of Technology, Speech@FIT, Czech Republic2 Department of Electrical and Computer Engineering, Johns Hopkins University, USA
{imikolov,kombrink,burget,cernocky}@fit.vutbr.cz, [email protected]
ABSTRACTWe present several modifications of the original recurrent neural net-work language model (RNN LM). While this model has been shownto significantly outperform many competitive language modelingtechniques in terms of accuracy, the remaining problem is the com-putational complexity. In this work, we show approaches that leadto more than 15 times speedup for both training and testing phases.Next, we show importance of using a backpropagation through timealgorithm. An empirical comparison with feedforward networks isalso provided. In the end, we discuss possibilities how to reduce theamount of parameters in the model. The resulting RNN model canthus be smaller, faster both during training and testing, and moreaccurate than the basic one.
Index Terms— language modeling, recurrent neural networks,speech recognition
1. INTRODUCTION
Statistical models of natural language are a key part of many systemstoday. The most widely known applications are automatic speechrecognition (ASR), machine translation (MT) and optical charac-ter recognition (OCR). In the past, there was always a struggle be-tween those who follow the statistical way, and those who claim thatwe need to adopt linguistics and expert knowledge to build mod-els of natural language. The most serious criticism of statistical ap-proaches is that there is no true understanding occurring in thesemodels, which are typically limited by the Markov assumption andare represented by n-gram models. Prediction of the next word isoften conditioned just on two preceding words, which is clearly in-sufficient to capture semantics. On the other hand, the criticism oflinguistic approaches was even more straightforward: despite all theefforts of linguists, statistical approaches were dominating when per-formance in real world applications was a measure.
Thus, there has been a lot of research effort in the field of statis-tical language modeling. Among models of natural language, neuralnetwork based models seemed to outperform most of the competi-tion [1] [2], and were also showing steady improvements in state ofthe art speech recognition systems [3]. The main power of neuralnetwork based language models seems to be in their simplicity: al-most the same model can be used for prediction of many types ofsignals, not just language. These models perform implicitly cluster-ing of words in low-dimensional space. Prediction based on thiscompact representation of words is then more robust. No additionalsmoothing of probabilities is required.
This work was partly supported by European project DIRAC (FP6-027787), Grant Agency of Czech Republic project No. 102/08/0707, CzechMinistry of Education project No. MSM0021630528 and by BUT FIT grantNo. FIT-10-S-2.
Fig. 1. Simple recurrent neural network.
Among many following modifications of the original model, therecurrent neural network based language model [4] provides furthergeneralization: instead of considering just several preceding words,neurons with input from recurrent connections are assumed to repre-sent short term memory. The model learns itself from the data howto represent memory. While shallow feedforward neural networks(those with just one hidden layer) can only cluster similar words,recurrent neural network (which can be considered as a deep archi-tecture [5]) can perform clustering of similar histories. This allowsfor instance efficient representation of patterns with variable length.
In this work, we show the importance of the Backpropagationthrough time algorithm for learning appropriate short term memory.Then we show how to further improve the original RNN LM by de-creasing its computational complexity. In the end, we briefly discusspossibilities of reducing the size of the resulting model.
2. MODEL DESCRIPTION
The recurrent neural network described in [4] is also called Elmannetwork [6]. Its architecture is shown in Figure 1. The vector x(t) isformed by concatenating the vector w(t) that represents the currentword while using 1 of N coding (thus its size is equal to the size ofthe vocabulary) and vector s(t − 1) that represents output values inthe hidden layer from the previous time step. The network is trainedby using the standard backpropagation and contains input, hiddenand output layers. Values in these layers are computed as follows:
x(t) = [w(t)Ts(t − 1)T ]T (1)
sj(t) = f
Xi
xi(t)uji
!(2)
yk(t) = g
Xj
sj(t)vkj
!(3)
5528978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011
112 / 127
![Page 113: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/113.jpg)
Results (Mikolov et al., 2011)Table 6: Comparison of advanced language modeling tech-niques on the WSJ task with all training data.
Model Dev WER[%] Eval WER[%]Baseline - KN5 12.2 17.2Discriminative LM [14] 11.5 16.9Joint LM [7] - 16.7Static RNN 10.5 14.9Static RNN + KN 10.2 14.6Adapted RNN 9.8 14.5Adapted RNN + KN 9.8 14.5All RNN 9.7 14.4
5. Conclusion and future workOn the Penn Treebank, we achieved a new state of the art resultby using a combination of many advanced language modelingtechniques, surpassing previous state of the art by a large margin- the obtained perplexity 79.4 is significantly better than 96 re-ported in our previous work [5]. Compared to the perplexity ofGood-Turing smoothed trigram that is 165.2 on this setup, wehave achieved 52% reduction of perplexity, and 14.3% reduc-tion of entropy. Compared to the 5-gram with modified Kneser-Ney smoothing that has perplexity 141.2, we obtained 44% re-duction of perplexity and 11.6% reduction of entropy.
On the WSJ task, we have shown that the possible improve-ments actually increase with more training data. Although wehave used just two RNNLMs that were trained on all data, weobserved similar gains as on the previous setup. Against Good-Turing smoothed trigram that has perplexity 246, our final result108 is by more than 56% lower (entropy reduction 15.0%). The5-gram with modified Kneser-Ney smoothing has on this taskperplexity 212, thus our combined result is by 49% lower (en-tropy reduction 12.6%).
As far as we know, our work is the first attempt to com-bine many advanced language modeling techniques after thework done by Goodman [1], as usually combination of onlytwo or three techniques is reported. We have found that manytechniques are actually redundant and do not contribute signif-icantly to the final combination - it seems that by using Re-current neural network based language models and a standardn-gram model, we can obtain near-optimal results. However,this should not be interpreted as that further work on other tech-niques is useless. We are aware of several possibilities how tomake better use of individual models - it was reported that log-linear interpolation of models [15] outperforms in some casessignificantly the basic linear interpolation. While we have notseen any significant gains when we combined log-linearly indi-vidual RNNLMs, for combination of different techniques, thismight be an interesting extension of our work in the future.However, it should be noted that log-linear interpolation is com-putationally very expensive.
As the final combination is dominated by the RNNLM, webelieve that future work should focus on its further extension.We observed that combination of different RNNLMs works bet-ter than any individual RNNLM. Even if we combine modelsthat are individually suboptimal, as was the case when we usedlarge learning rate during adaptation, we observe further im-provements. This points us towards investigating Bayesian neu-ral networks, that consider all possible parameters and hyper-parameters. We actually assume that combination of RNNLMsbehaves as a crude approximation of a Bayesian neural network.
6. AcknowledgementsWe would like to thank to all who have helped us to reach thedescribed results, directly by sharing their results with us or in-directly by providing freely available toolkits3 : Ahmad Emami(Random Clusterings and Structured NNLM), Hai Son Le andIlya Oparin (Feedforward NNLM and Log-bilinear model), YiSu (Random forest LM), Denis Filimonov (Structured LM),Puyang Xu (ME model), Dietrich Klakow (Within and AcrossSentence Boundary LM), Tanel Alumae (ME model), AndreasStolcke (SRILM).
7. References[1] Joshua T. Goodman (2001). A bit of progress in language model-
ing, extended version. Technical report MSR-TR-2001-72.[2] Yoshua Bengio, Rejean Ducharme and Pascal Vincent. 2003. A
neural probabilistic language model. Journal of Machine LearningResearch, 3:1137-1155
[3] H. Schwenk. Continuous space language models. ComputerSpeech and Language, vol. 21, 2007.
[4] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky,Sanjeev Khudanpur: Recurrent neural network based languagemodel, In: Proc. INTERSPEECH 2010.
[5] Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky,Sanjeev Khudanpur: Extensions of recurrent neural network lan-guage model. In: Proc. ICASSP 2011.
[6] Peng Xu. Random forests and the data sparseness problem in lan-guage modeling, Ph.D. thesis, Johns Hopkins University, 2005.
[7] Denis Filimonov and Mary Harper. A joint language model withfine-grain syntactic tags. EMNLP 2009.
[8] Ahmad Emami, Frederick Jelinek. Exact training of a neural syn-tactic language model. In: Proc. ICASSP, 2004.
[9] Tomas Mikolov, Jirı Kopecky, Lukas Burget, Ondrej Glembekand Jan Cernocky: Neural network based language models forhighly inflective languages, In: Proc. ICASSP 2009.
[10] A. Emami, F. Jelinek. Random clusterings for language modeling.In: Proc. ICASSP 2005.
[11] S. F. Chen and J. T. Goodman. An empirical study of smoothingtechniques for language modeling. In Proceedings of the 34th An-nual Meeting of the ACL, 1996.
[12] Andreas Stolcke. SRILM - language modeling toolkit.http://www.speech.sri.com/projects/srilm/
[13] Tanel Alumae and Mikko Kurimo. Efficient Estimation of Maxi-mum Entropy Language Models with N-gram features: an SRILMextension , In: Proc. INTERSPEECH 2010.
[14] Puyang Xu, D. Karakos, S. Khudanpur. Self-Supervised Discrim-inative Training of Statistical Language Models. ASRU 2009.
[15] D. Klakow. Log-linear interpolation of language models. In: Proc.Internat. Conf. Speech Language Process., 1998.
[16] S. Momtazi, F. Faubel, D. Klakow. Within and Across SentenceBoundary Language Model. In: Proc. INTERSPEECH 2010.
[17] W. Wang and M. Harper. The SuperARV language model: Investi-gating the effectiveness of tightly integrating multiple knowledgesources. In: Proc. EMNLP, 2002.
[18] H.-S. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, F. Yvon. Struc-tured Output Layer Neural Network Language Model. In: Proc.ICASSP 2011.
[19] A. Mnih and G. Hinton. Three new graphical models for statisticallanguage modelling. Proceedings of the 24th international confer-ence on Machine learning, 2007.
[20] D. E. Rumelhart, G. E. Hinton, R. J. Williams. 1986. Learninginternal representations by back-propagating errors. Nature, 323,533-536.
3Our own toolkit for training RNNLMs is available athttp://www.fit.vutbr.cz/~imikolov/rnnlm/
113 / 127
![Page 114: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/114.jpg)
Discussion
Some of best WER results in LM literature.Gain of up to 3% absolute WER over trigram (not <1%).
Interpolation with word n-gram optional.Can integrate arbitrary features, e.g., syntactic features.
Easy to condition on longer histories.Training and evaluation is slow.
Optimizations: class-based modeling; reduced vocab.Publicly-available toolkit: http://rnnlm.org
114 / 127
![Page 115: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/115.jpg)
Where Are We?
1 Introduction to Maximum Entropy Modeling
2 N-Gram Models and Smoothing, Revisited
3 Maximum Entropy Models, Part III
4 Neural Net Language Models
5 Discussion
115 / 127
![Page 116: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/116.jpg)
Other Directions in Language Modeling
Discriminative training for LM’s.Super ARV LM.LSA-based LM’s.Variable-length n-grams; skip n-grams.Concatenating words to use in classing.Context-dependent word classing.Word classing at multiple granularities.Alternate parametrizations of class n-grams.Using part-of-speech tags.Semantic structured LM.Sentence-level mixtures.Soft classing.Hierarchical topic models.Combining data/models from multiple domains.Whole-sentence maximum entropy models.
116 / 127
![Page 117: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/117.jpg)
State of the Art, Production Systems
Word n-gram models (1st pass static).Modified Kneser-Ney smoothing?
Mostly word n-gram models (dynamic/rescoring).Gain not worth the cost/complexity.
The more data, the less gain!
train FLOPS/ev eval FLOPS/evn-gram 5 1–3model M 20× 100 5–10NNLM 106 × 20 106
117 / 127
![Page 118: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/118.jpg)
State of the Art, Research Systems
e.g., government evals.Small differences in WER matter; may not have muchdata.Interpolation of word n-gram models.Rescoring w/ neural net LM’s; Model M (-0.5% WER?)
Modeling medium-to-long-distance dependencies.Almost no gain in combination with other techniques?Not worth extra effort and complexity.
118 / 127
![Page 119: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/119.jpg)
An Apology to N-Gram Models
I didn’t mean what I said about you.You know I was kidding when I said . . .
You are great to poop on.
119 / 127
![Page 120: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/120.jpg)
Lessons: Perplexity 6= WER
e.g., One Billion Word Benchmark.Vast perplexity improvements⇒ small WER gains.
20
25
30
35
4.5 5 5.5 6 6.5
WE
R
log PP120 / 127
![Page 121: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/121.jpg)
What’s Important: Data!
20
25
30
35
40
45
100 1000 10000
wor
der
rorr
ate
training set size (K words)
121 / 127
![Page 122: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/122.jpg)
What’s Not As Important: Algorithms!
-1
-0.5
0
0.5
1
1.5
2
2.5
100 1000 10000
WE
Rga
in
training set size (K words)
122 / 127
![Page 123: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/123.jpg)
Where Do We Go From Here?
N-gram models are just really easy to build.Smarter LM’s tend to be orders of magnitude slower.Faster computers? Data sets also growing.
Need to effectively combine many sources of information.Short, medium, and long distance.Log-linear models, NN’s promising, but slow to train.
Evidence that LM’s will help more when WER’s are lower.Human rescoring of N-best lists (Brill et al., 1998).
123 / 127
![Page 124: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/124.jpg)
ReferencesA.L. Berger, S.A. Della Pietra, V.J. Della Pietra, “A maximumentropy approach to natural language processing”,Computational Linguistics, vol. 22, no. 1, pp. 39–71, 1996.
E. Brill, R. Florian, J.C. Henderson, L. Mangu, “BeyondN-Grams: Can Linguistic Sophistication Improve LanguageModeling?”, ACL, pp. 186–190, 1998.
S.F. Chen, “Performance Prediction for ExponentialLanguage Models”, IBM Research Division technical reportRC 24671, 2008.
S.F. Chen and S.M. Chu, “Enhanced Word Classing forModel M”, Interspeech, 2010.
S.F. Chen and J. Goodman, “An Empirical Study ofSmoothing Techniques for Language Modeling”, HarvardUniversity technical report TR-10-98, 1998.
124 / 127
![Page 125: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/125.jpg)
ReferencesS.F. Chen and R. Rosenfeld, “A Survey of SmoothingTechniques for Maximum Entropy Models”, IEEETransactions on Speech and Audio Processing, vol. 8, no. 1,pp. 37–50, 2000.
E.T. Jaynes, “Information Theory and Statistical Mechanics”,Physics Reviews, vol. 106, pp. 620–630, 1957.
R. Kneser and H. Ney, “Improved Backing-off for M-GramLanguage Modeling”, ICASSP, vol. 1, pp. 181–184, 1995.
H. Schwenk and J.L. Gauvain, “Training Neural NetworkLanguage Models on Very Large Corpora”, HLT/EMNLP, pp.201–208, 2005.
T. Mikolov, “Statistical Language Models based on NeuralNetworks”, Ph.D. thesis, Brno University of Technology,2012.
125 / 127
![Page 126: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/126.jpg)
Road Map
126 / 127
![Page 127: Lecture 10 - Advanced Language Modeling · 2016-03-30 · Lecture 10 Advanced Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson](https://reader033.vdocuments.site/reader033/viewer/2022060403/5f0e89087e708231d43fb62b/html5/thumbnails/127.jpg)
Course Feedback
1 Was this lecture mostly clear or unclear? What was themuddiest topic?
2 Other feedback (pace, content, atmosphere)?
127 / 127