building a state-of-the-art asr system with kaldi
TRANSCRIPT
BuildingSpeechRecogni0onSystemswiththeKaldiToolkit
SanjeevKhudanpur,DanPoveyandJanTrmalJohnsHopkinsUniversity
CenterforLanguageandSpeechProcessingJune13,2016
Inthebeginning,therewasnothing• ThenKaldiwasborninBal0more,MD,in2009.
Kaldithengrewup&became…
0
50
100
150
200
250
Jan-12
Mar-12
May-12
Jul-1
2
Sep-12
Nov-12
Jan-13
Mar-13
May-13
Jul-1
3
Sep-13
Nov-13
Jan-14
Mar-14
May-14
Pos$ngstoDiscussionList
60+ContributorsIconfromhVp://thumbs.gograph.com
Meanwhile,SpeechSearchwentfrom“Solved”to“Unsolved”…Again
• NISTTRECSDR(1998)– Spoken“document”retrievalfromSTToutputasgoodasretrievalfromreferencetranscripts
– Speechsearchwasdeclaredasolvedproblem!
• NISTSTDPilot(2006)– STTwasfoundtobeinadequateforspoken“term”detec0oninconversa0onaltelephonespeech
• LimitedlanguagediversityinCTScorpora– EnglishSwitchboard,CallHomeandFisher– ArabicandMandarinChineseCallHome
In2012,IARPAlaunchedBABELOnemonthaderDanPoveyreturnedtoKaldi’sbirthplace
• Automa0ctranscrip0onofconversa0onaltelephonespeechwass0llthecorechallenge.
• Butwithafewsubtle,crucialchanges– FocusedaVen0ononlow-resourcecondi0ons– Requiredconcurrentprogressinmul0plelanguages
• PY1:Cantonese,Tagalog,Pashto,TurkishandVietnamese• PY2:Assamese,Bengali,Hai0anCreole,Lao,ZuluandTamil
– Reducedsystemdevelopment0mefromyeartoyear– Usedkeywordsearchmetricstomeasureprogress
KaldiTodayAcommunityofResearchersCoopera0velyAdvancingSTT
• C++library,command-linetools,STT“recipes”– FreelyavailableviaGitHub(Apache2.0license)
• TopSTTperformanceinopenbenchmarktests– E.g.NISTOpenKWS(2014)andIARPAASpIRE(2015)
• Widelyadoptedinacademiaandindustry– 300+cita0onsin2014(basedonGooglescholardata)– 400+cita0onsin2015(basedonGooglescholardata)– UsedbyseveralUSandnon-UScompanies
• Main“trunk”maintainedbyJohnsHopkins– Forkscontainspecializa0onsbyJHUandothers
Co-PI’s,PhDStudentsandSponsors• SanjeevKhudanpur• DanielPovey• JanTrmal• GuoguoChen• PegahGhahremani• VimalManohar• VijayadityaPeddin0• HainanXu• XiaohuiZhang• andseveralothers
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
SenngupPaths,QueueCommands,…
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous$cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
PreparingAcous0cTrainingData
data/train/text
data/train/wav.scp
data/train/(uV2spk|spk2uV)
data/train/(cmvn.scp|feats.scp)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia$onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
PreparingthePronuncia0onLexicon
data/local/dict/lexicon.txt
data/local/dict/*silence*.txt
data/local/lang
WordBoundaryTags
Disambigua0onSymbols
data/lang
data/lang/(phones|words).txt
data/lang/topo
data/lang/phones/roots.txt
data/lang/phones/extra_ques0ons.txt
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
PreparingtheLanguageModel
local/train_lms_srilm.sh
local/train_lms_srilm.sh(cont’d)
InterpolatedLanguageModels
local/arpa2G.sh
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous$cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
GMMTraining(1)
GMMTraining(2)
cluster-phones,compile-ques0ons,build-tree
GMMTraining(4)
GMMTraining(5)
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea$ngasta$cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
BuildingHCLG(1)
BuildingHCLG(2)
BuildingHCLG(3)
BuildingHCLG(4)
DecodingandLanceRescoring
steps/decode_sgmm2.sh
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– LaAcerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
steps/lmrescore_const_arpa.sh
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
local/nnet3/run_ivector_common.sh
steps/nnet3/tdnn/make_configs.py
steps/nnet3/train_dnn.py
BuildinganSTTSystemwithKaldi• Dataprepara0on
– Acous0cmodeltrainingdata– Pronuncia0onlexicon– Languagemodeltrainingdata
• BasicGMMsystembuilding– Acous0cmodeltraining– Languagemodeltraining
• BasicDecoding– Crea0ngasta0cdecodinggraph– Lancerescoring
• BasicDNNsystembuilding• Goingbeyondthebasics
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
DeepNeuralNetworksforSTT
DNNAcous0cModelsfortheMasses
• NontrivialtogettheDNNmodelstoworkwell– Designdecisions:#layers,#nodes,#outputs,typeofnonlinearity,trainingcriterion
– Trainingart:learningrates,regulariza0on,updatestability(maxchange),datarandomiza0on,#epochs
– Computa0onalart:matrixlibraries,memorymgmt• Kaldirecipesprovidearobuststar0ngpoint
Corpus TrainingSpeech SGMMWER DNNWER
BABELPashto 10hours 69.2% 67.6%
BABELPashto 80hours 50.2% 42.3%
FisherEnglish 2000hours 15.4% 10.3%
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
Low-ResourceSTTfortheMasses
• Kaldiprovideslanguage-independentrecipes– TypicalBABELFullLPcondi0on
• 80hoursoftranscribedspeech,800KwordsofLMtext,20Kwordpronuncia0onlexicon
– TypicalBABELLimitedLPcondi0on• 10hoursoftranscribedspeech,100KwordsofLMtext,6Kwordpronuncia0onlexicon
Language Cantonese Tagalog Pashto Turkish
Speech 80h 10h 80h 10h 80h 10h 80h 10h
CER/WER 48.5% 61.2% 46.3% 61.9% 50.7% 63.0% 51.3% 65.3%
ATWV 0.47 0.26 0.56 0.28 0.46 0.25 0.52 0.25
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
Parallel(GPU-based)Training
• Originalneuralnetworktrainingalgorithmswereinherentlysequen0al(e.g.SGD)
• Scalingupto“bigdata”becomesachallenge• Severalsolu0onshaveemergedrecently– 2009:DelayedSGD(Yahoo!)– 2011:Lock-freeSGD(Hogwild!UWisconsin)– 2012:Gradientaveraging(DistBelief,Google)– 2014:Modelaveraging(NG-SGD,Kaldi)
ModelAveragingwithNG-SGD
• TrainDNNswithlargeamountofdata– U0lizeaclusterofCPUsorGPUs– Minimizenetworktraffic(esp.forCPUs)
• Solu0on:exploitdataparalleliza0on– Updatemodelinparallelovermanymini-batches– Infrequentlyaveragemodels(parameters)
• Use“Natural-Gradient”SGDformodelupda0ng– Approximatescondi0oningviainverseFishermatrix– Improvesconvergenceevenwithoutparalleliza0on
Paralleliza0onMaVers!
• Typically,aGPUis10xfasterthana16coreCPU• Linearspeed-up0llca4GPUs,thendiminishing
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
IARPA’sOpenChallenge• Automa0cspeechrecogni0onsodwarethatworksinavariety
ofacous0cenvironmentsandrecordingscenariosisaholygrailofthespeechresearchcommunity.
• IARPA’sAutoma0cSpeechrecogni0onInReverberantEnvironments(ASpIRE)Challengeisseekingthatgrail.
RulesoftheASpIREChallenge
• 15hoursofspeechdatawerepostedontheIARPAwebsite– Mul0-microphonerecordingsofconversa0onalEnglish– 5hdevelopmentset(dev),10hdevelopment-testset(dev-test)– Transcrip0onsprovidedfordev,onlyscoringfordev-testoutput– Fortrainingdataselec0on,systemdevelopmentandtuning
• 12hoursofnewspeechdataduringtheevalua0onperiod– Far-fieldspeech(eval)fromnoisy,reverberantrooms– Single-microphoneormul0-microphonecondi0ons
• Worderrorrateisthemeasureofperformance– Single-microphonesubmissionsweredueon02/18/2015– Resultswereofficiallyannouncedon09/10/2015
ExamplesofASpIREAudio
• Typicalsample– SuggestedbyDr.MaryHarper
• Almostmanageable– Easyforhumans,26%errorsforASR
• Somewhathard– Easyforhumans,41%errorsforASR
• Muchharder– Noteasyforhumans,60%errorsforASR
• *#@!!#%#%^^– Veryhardforhumans,noASRoutput
KaldiASRImprovementsforASpIRE
• Timedelayneuralnetworks(TDNN)– Awaytodealwithlongacous0c-phone0ccontext– Astructuredalterna0vetodeep/recurrentneuralnets
• Dataaugmenta$onwithsimulatedreverbera0ons– Awaytomi0gatechanneldistor0onsnotseenintraining– Aformofmul0-condi0ontrainingofASRmodels
• i-vectorbasedspeaker&environmentadapta$on– Awaytodealwithspeaker&channelvariability– Adapted[withatwist]fromSpeakerIDsystems
KaldiASRImprovements,ASpIRE++
• Pronuncia$onandinter-wordsilencemodeling– Inspiredbypronuncia0on-prosodyinterac0ons– Asimplecontext-dependentmodelofinter-wordsilence
• Recurrentneuralnetworklanguagemodels(RNNLM)– A(known)waytomodellong-rangeworddependencies– Incorporatedpost-submissionintoJHUASpIREsystem
• OngoingKaldiinves0ga0onsthatholdpromise– Semi-superviseddiscrimina0vetrainingof(T)DNNs– Longshort-termmemory(LSTM)acous0cmodels– Connec0onisttemporalclassifica0on(CTC)models
TimeDelayNeuralNetworks(SeeourpaperatINTERSPEECH2015fordetails)
A28YearOldIdea,Resurrected
AlexWaibel,KevinLang,etal(1987)t-13 t+9
t+7
t+5 t-10
t-7 t+2
t-1 t-4
t-8 t-5 t-2 t+1 t+4 t-11
-7 +2
-3 +3 -3 +3
-1 +2 -1 +2 -1 +2 -1 +2
Layer 4
Layer 3
Layer 2
Layer 1 +2 -2
OurTDNNArchitecture(2015)
1
5
2
3
4
ImprovedASRonSeveralDataSets
• Consistent5-10%reduc0oninworderrorrate(WER)overDNNsonmostdatasets,includingconversa0onalspeech.
• TDNNtrainingspeedsareonparwithDNN,andnearlyanorderofmagnitudefasterthanRNN
StandardASRTestSets Size DNN TDNN Rel.Δ
WallStreetJournal 80hrs 6.6% 6.2% 5%
TED-LIUM 118hrs 19.3% 17.9% 7%
Switchboard 300hrs 15.5% 14.0% 10%
LibriSpeech 960hrs 5.2% 4.8% 7%
FisherEnglish 1800hrs 22.2% 21.0% 5%
ASpIRE(FisherTraining) 1800hrs 47.7% 47.6%
DataAugmenta0onforASRTraining(SeeourpaperatINTERSPEECH2015fordetails)
Simula0ngReverberantSpeechforMul0-condi0on(T)DNNTraining
• Simulateca5500hoursofreverberant,noisydatafrom1800hoursoftheFisherEnglishCTScorpus– Replicateeachoftheca21,000conversa0onsides30mes– Randomlychangethesamplingrate[upto±10%]– Convolveeachconversa0onsidewithoneof320real-liferoomimpulseresponses(RIR)chosenatrandom
– Addnoisetothesignal(whenavailablewiththeRIR)• Generate(T)DNNtraininglabelsfromcleanspeech– Align“pre-reverb”speechtoca7500CD-HMMstates
• TrainDNNandTDNNacous0cmodels– Cross-entropytrainingfollowedbysequencetraining
ResultofDataAugmenta0onAcous$cModel DataAugmenta$on DevWER
TDNNA(230ms) None(1800h,cleanspeech) 47.6%
TDNNA(230ms) +3x(reverbera0on+noise) 31.7%
TDNNB(290ms) +3x(reverbera0on+noise) 30.8%
TDNNA(230ms) +samplingrateperturba0on 31.0%
TDNNB(290ms) +samplingrateperturba0on 31.1%
• Dataaugmenta0onwithsimulatedreverbera0onisbeneficial– Likelytobeaveryimportantreasonforrela0velygoodperformance
• Samplingrateperturba0ondidn’thelpmuchonASpIREdata• SequencetraininghelpedreduceWERonthedevset
– RequiredmodifyingthesMBRtrainingcriteriontorealizegains– Butthegainsdidnotcarryovertodev-testset
i-vectorsforSpeakerCompensa0on(SeeourpaperatINTERSPEECH2015fordetails)
Usingi-vectorsInsteadoffMLLRandusingunnormalizedMFCCstocomputei-vectors
• 100-dimi-vectorsareappendedtoMFCCinputsoftheTDNN– i-vectorsarecomputedfromrawMFCCs(i.e.nomeansubtrac0onetc)– UBMposteriorshoweveruseMFCCsnormalizedovera6secwindow
• i-vectorsarecomputedforeachtraininguVerance– Increasesspeaker-andchannelvariabilityseenintrainingdata– Maymodeltransientdistor0ons?e.g.movingspeakers,passingcars
• i-vectorsarecalculatedforeveryca60secoftestaudio– UBMpriorisweighted10:1topreventovercompensa0on– Weightofteststa0s0csiscappedat75:1rela0vetoUBMsta0s0cs
SpeakerCompensa$onMethod DevWER
TDNNwithouti-vectors 34.8%
+i-vectors(fromallframes) 33.8%
+i-vectors(fromreliablespeechframes) 30.8%
Pronuncia0onandSilenceProbabili0es(SeeourpaperatINTERSPEECH2015fordetails)
Trigram-likeInter-wordSilenceModel
P s a_b( ) = P s a_( )F s _b( )
F s _b( ) =c sb( )+λ3
c !a *b( )P s !a _( )!a∑ +λ3
P s a_( ) =c as( )+λ2P s( )c a( )+λ2
Is“Prosody”FinallyHelpingSTT?Task TestSet Baseline +Sil/Pron
WSJ Eval92 4.1 3.9
Switchboard Eval2000 20.5 20.0
TED-LIUM Test 18.1 17.9
LibriSpeechTestClean 6.6 6.6
TestOther 22.9 22.5
• Modelingpronuncia0onandsilenceprobabili0esyieldsmodestbutconsistentimprovementonmanylargevocabularyASRtasks
Pronuncia$on/SilenceProbabili$es DevWER
Noprobabili0esinthelexicon 32.1%
+pronuncia0onprobabili0es 31.6%
+inter-wordsilenceprobabili0es 30.8%
RecurrentNeuralNetworkbasedLanguageModels
(SeeourpaperatINTERSPEECH2010forthefirst“convincing”results)
RNNLMonASpIREDataLanguageModelandRescoringMethod DevWER
4-gramLMandlancerescoring 30.8%
RNN-LMand100-bestrescoring 30.2%
RNN-LMand1000-bestrescoring 29.9%
RNN-LM(4-gramapproxima0on)lancerescoring 29.9%
RNN-LM(6-gramapproxima0on)lancerescoring 29.8%
• AnRNNLMconsistentlyoutperformstheN-gramLM• TheKaldilancerescoringappearstocausenolossin
performance– Approxima0onentailsnot“expanding”thelancetorepresenteachuniquehistoryseparately
– WhentwopathsmergeinanN-gramlance,onlyones(t)ischosenatrandomandpropagatedforward
TheIARPAASpIRELeaderBoard
Rank Par$cipant DevWER SystemType
1 tsakilidis 27.2% Combina0on
2 rhsiao 27.5% Combina0on
3 vijaypeddin0 27.7% SingleSystem
hVp://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge
hVp://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge
PerformanceonEvalua0onData
Acous$cModel LanguageModel DevWER
TestWER
EvalWER
TDNNB(CEtraining) 4-gram 30.8% 27.7% 44.3%
TDNNB(sMBRtraining) 4-gram 29.1% 28.9% 43.9%
TDNNB(CEtraining) RNN 29.8% 26.5% 43.4%
TDNNB(sMBRtraining) RNN 28.3% 28.2% 43.4%
Par$cipant TestWER SystemType
Kaldi 44.3% SingleSystem
BBN(andothers) 44.3% Combina0on
I2R(Singapore) 44.8% Combina0on
KeystoGoodPerformanceonASpIRE
• Timedelayneuralnetworks(TDNN)– Dealwellwithlongreverbera0on0mes
• i-vectorbasedadapta0oncompensa0on– Dealswithspeaker&channelvariability
• Dataaugmenta0onwithsimulatedreverbera0ons– Dealswithchanneldistor0onsnotseenintraining
• Pronuncia0onandinter-wordsilenceprobabili0es– Helpfulinadverseacous0ccondi0ons
TheJHUASpIRESystem(SeeourASRU2015paperfordetails)
Semi-supervisedMMITraining(SeeourpaperatINTERSPEECH2015fordetails)
Discrimina0ve(MMI)Training:ahand-waving,mostlycorrectintroduc0on
θ̂ML = argmaxθ
logP Ot Wt ;θ( )t=1
T
∑ KL P̂ Pθ( )
θ̂MMI = argmaxθ
logP Ot Wt;θ( )
P Ot !Wt;θ( )P !Wt( )!Wt
∑
#
$%
&%
'
(%
)%t=1
T
∑ I W ∧O;θ( )
Cross-en
trop
ytraining
Sequ
encetraining
Semi-SupervisedSequenceTraining
• Sequencetrainingimprovessubstan0allyoverbasiccross-entropytrainingofDNNacous0cmodels
• Semi-supervisedcross-entropytraining–byaddingunlabeleddata–alsoimprovessubstan0allyoverbasiccross-entropytrainingonlabeleddata
• Butsemi-supervisedsequencetrainingis“tricky”– Sensi0vitytoincorrecttranscrip0onseemsgreater– Confidence-basedfilteringorweigh0ngmustbeapplied– Empiricalresultsarenotverysa0sfactory
Semi-supervisedSequenceTraining:withoutcomminngtoasingletranscrip0on
• ViewMMItrainingasminimizingacondi0onalentropy
I W ∧O ;θ( ) =1T
logP Ot Wt ;θ( )P Ot ;θ( )t=1
T
∑ =1T
logP Ot Wt ;θ( )
P Ot #W ; θ )P #W( )(#W∑t=1
T
∑
I W ∧O ;θ( ) = H W( ) − H W O ;θ( ) = H W( ) − 1T
H W Ot ;θ( )t=1
T
∑
• ThelaVerdoesnotrequirecomminngtoasingleWt– Wellsuitedforunlabeledspeech– Entailscompu0ngasumoverallW’sinthelance
Compu0ngLanceEntropyUsingExpecta0onSemi-rings
• Howtoefficientlycompute
• Replacearc-probabili0espiwiththepair(pi,pilog{pi})
−H W Ot ;θ( ) = P π( ) logP π( )π∈L∑
Z Ot ;θ( ) = P π( )π∈L∑
Semi-ringElement&Operators (p,p×log{p})
(p1,p1log{p1})+(p2,p2log{p2}) (p1+p2,p1log{p1}+p2log{p2})
(p1,p1log{p1})×(p2,p2log{p2}) (p1p2,p1p2log{p2}+p2p1log{p1})
• Takeinspira0onfromthecomputa0onof
Semi-supervisedSequenceTraining:KeyDetailsNeededtoMakeitWork
• ViewtrainingcriterionasMCEinsteadofMMI– i.e.argminH(W|O;θ)insteadofargmaxI(W∧O;θ)– EfficientlycomputeH(W|O;θ)forthelance,anditsgradient,viaBaumWelchwithspecialsemi-rings
• Useseparateoutput(sod-max)layersintheDNNforlabeledandunlabeleddata– Inspiredbymul0lingualDNNtrainingmethods
• Useaslightlydifferent“prior”forconver0ngDNNposteriorprobabili0estoacous0clikelihoods
ResultsforSemi-SupervisedMMIonFisherEnglishCTS
DNNTrainingMethod(hoursofspeech) DevWER TestWER
Cross-EntropyTraining(100hlabeled) 32.0 31.2
CE(100hlabeled+250hself-labeled) 30.6 29.8
CE(100hlabeled+250hweighted) 30.5 29.8
SequenceTraining(100hlabeled) 29.6 28.5
SeqTraining(100hlabeled+250hweighted) 29.9 28.8
SeqTraining(100hlabeled+250hMCE) 29.4 28.1
SequenceTraining(350hlabeled) 28.5 27.5
• Recoversabout40%ofthesupervisedtraininggain– Inves0ga0onunderwayfor2000hofunlabeledspeech
• RepeatableresultsonBABELdatasetswith10hsupervisedtraining+50-70hunsupervised
Know
nuseof
unlabe
leddata
Bejeruseof
unlabe
leddata
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
HeterogeneousTrainingCorpora• Transcribedspeechfromdifferentcollec0onsarenoteasy
tomergeforSTTtraining– Genreandspeakingstyledifferences– Differentchannelcondi0ons– Slightlydifferenttranscrip0onconven0ons
• Typicalresult:thecorpusmatchedtotestdatagivesbestSTTresults;othersdon’thelp,some0meshurt!
• SCALE2015casestudywithPashtoCTS– Collectedincountry,andtranscribed,bysamevendor– Roughly80hourseachinthe
• AppenLILAcorpusandIARPABABELcorpus– Pronuncia0onlexicontocovertranscripts;samephoneset
AStudyinPashto(Amanuscriptisinprepara0onforfuturepublica0on)
AStudyinPashto• Transcrip0onsrequire
extensivecross-corpusnormaliza0on
• Evenaderthat,languagemodelsdon’tbenefitmuchfromcorpuspooling
• Simplecorpuspoolingdoesn’timproveacous0cmodelingeither
• DNNswithshared“inner”layersandcorpus-specificinputandoutputlayersworkbest
TrainingData
SingleModel
Interpola$onWeightsLMALMBLMT
InterpolatedModel
TextA 99.2 0.8 0.2 0.0 92.9
TextB 141.9 0.1 0.8 0.1 140.0
TextT 86.7 0.0 0.0 1.0 86.7
Mul$-corpus(A+B)TrainingStrategy
STTWordErrorRatesTestATestBTestT
SharedDNNlayers(except1) 53.2% 47.4% 27.0%
SharedDNNlayers(except2) 51.2% 45.0% 25.4%
+Op0mizedLanguageModel 50.8% 44.8% 25.4%
+Dura0onModeling 50.4% 44.3% 24.8%
DNNTrainingData STTWordErrorRatesTestATestBTestT
Singlecorpus(matched) 55.4% 46.8% 24.8%
Twocorpora(PashtoA+B) 51.9% 48.2% 52.6%
AdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)
• Apreviewofsomeupcomingdevelopments
OtherAddi0onsandInnova0ons• Semi-supervised(MMI)training– Usingunlabeledspeechtoaugmentalimitedtranscribedspeechcorpus
• Mul0lingualacous0cmodeltraining– Usingother-languagespeechtoaugmentalimitedtranscribedspeechcorpus
• Removingrelianceonpronuncia0onlexicons– Graphemebasedmodelsandacous0callyaidedG2P
• Chainmodels– 10%moreaccurateSTT,plus– 3xfasterdecoding,and5x-10xfastertraining
TheGenesisofChainModels
• Connec0onistTemporalClassifica0on– Thelatestshinytoyinneuralnetwork-basedacous0cmodelingforSTT(ICASSPandInterSpeech2015)
– NiceSTTimprovementsshownonGoogledatasets– Wehaven’tseenSTTgainsonourdatasets
• ChainModels– Inspiredby(butquitedifferentfrom)CTC– SequencetrainingofNNswithoutCEpre-training– NiceSTTimprovementsoverpreviousbestsystems– 3xdecoding0mespeed-up;5x-10xtrainingspeed-up
2006:ANewKidontheNNetBlock
2015:TheNewKidComesofAge
2015:TheNewKidComesofAge
CTC,Explained…inPicturesFigurefromGravesetal,ICML2006
dh ax s aw n d
CTC,Explained…inPicturesFigurefromGravesetal,ICML2006
dh ax s aw n d
β
ββ
β β β β
DNNversusCTC:STTPerformanceFiguresandTablesfromSaketal,ICASSP2015
DNN Target CE sMBR
LSTM Senone 10.0% 8.9%
BLSTM Senone 9.7% 9.1%
CTC Target CE sMBR
LSTM Phone 10.5% 9.4%
BLSTM Phone 9.5% 8.5%
First,theBadNews…
• Wehaven’tbeenabletogetCTCmodelstogiveusanyno0ceableimprovementoverourbest(TDNNorLSTM-RNN)modelsonourdata– Itappearstobeeasiertogetthemtoworkwhenonehasseveral1000hoursoflabeledspeech
– Butwecareaboutlower-resourcescenarios
…andthentheGoodNews
• Weareabletogetsimilarimprovementsusingadifferentmodel,whichisinspiredbyideasfromtheCTCpapers– Usesimple“1-state”HMMsforeachCDphone– Reduceframeratefrom100Hzto33Hz– Permitslackintheframe-to-statealignment
ChainModelsandLF-MMITraining
• Anewclassofacous0cmodelsforhybridSTT– “1-state”HMMforeachcontext-dependentphone– LSTM/TDNNscomputestateposteriorprobabili0es
• MFCCsaredown-sampledfrom100Hzto33Hz– InspiredbyCTC
• Anewlance-freeMMItrainingmethod– Improvedparalleliza0on,sequencetrainingonGPUs
• Largermini-batches,smallerI/Obandwidth– DoesnotrequireCEtrainingbeforeMMItraining– Uses“flexiblelabelalignment”inspiredbyCTC
Discrimina0ve(MMI)Training:ahand-waving,mostlycorrectintroduc0on
θ̂ML = argmaxθ
logP Ot Wt ;θ( )t=1
T
∑ KL P̂ Pθ( )
θ̂MMI = argmaxθ
logP Ot Wt;θ( )
P Ot !Wt;θ( )P !Wt( )!Wt
∑
#
$%
&%
'
(%
)%t=1
T
∑ I W ∧O;θ( )
Lance-FreeMMITraining
• Denominator(phone)graphcrea0on– Useaphone4-gramlanguagemodel,L– ComposeH,CandLtoobtaindenominatorgraph
• ThisFSAisthesameforalluVerances;suitsGPUtraining• Use(heuris0c)sentence-specificini0alprobabili0es
• Numeratorgraphcrea0on– Generateaphonegraphusingtranscripts
• ThisFSAencodesframe-by-framealignmentofHMMstates– Permitsomealignment“slack”foreachframe/label– IntersectslackenedFSAwiththedenominatorFSA
Regulariza$on Hub-5‘00WordErrorRate
CrossEntropy L2Norm LeakyHMM Total SWBDN N N 16.8% 11.1%Y N N 15.9% 10.5%N Y N 15.9% 10.4%N N Y 16.4% 10.9%Y Y N 15.7% 10.3%Y N Y 15.7% 10.3%N Y Y 15.8% 10.4%Y Y Y 15.6% 10.4%
Lance-freeMMITraining(cont’d)• LSTM-RNNstrainedwiththisMMItrainingprocedureare
highlysuscep0bletoover-finng• Essen0altoregularizetheNNtrainingprocess
– AsecondoutputlayerforCEtraining– OutputL2regulariza0on– UsealeakyHMM
STTResultsforChainModels300hoursofSWBDTrainingSpeech;Hub-5‘00Evalua0onSet
TrainingObjec$ve Model(Size) TotalWER
SWBDWER
Cross-Entropy TDNNA(16.6M) 18.2% 12.5%
CE+sMBR TDNNA(16.6M) 16.9% 11.4%
Lance-freeMMI
TDNNA(9.8M) 16.1% 10.7%
TDNNB(9.9M) 15.6% 10.4%
TDNNC(11.2M) 15.5% 10.2%
LF-MMI+sMBR TDNNC(11.2M) 15.1% 10.0%
• LF-MMIreducesWERbyca10%-15%rela>ve• LF-MMIisbeVerthanstandardCE+sMBRtraining(ca8%)• LF-MMIimprovesveryslightlywithaddi0onalsMBRtraining
ChainModelsandLF-MMITrainingSTTPerformanceonaVarietyofCorpora
CorpusandAudioType
TrainingSpeech
CE+sMBRErrorRate
LF-MMIErrorRate
AMIIHM 80hours 23.8% 22.4%
AMISDM 80hours 48.9% 46.1%
TED-LIUM 118hours 11.3% 12.8%
Switchboard 300hours 16.9% 15.5%
Fisher+SWBD 2100hours 15.0% 13.3%
• ChainmodelswithLF-MMIreduceWERby6%-11%(rela>ve)• LF-MMIimprovesabitfurtherwithaddi0onalsMBRtraining• FL-MMIis5x-10xfastertotrain,3xfastertodecode
ARecapofChainModels
• Anewclassofacous0cmodelsforhybridSTT– “1-state”HMMforcontext-dependentphones– LSTM-RNNacous0cmodels(TDNNalsocompa0ble)
• Anewlance-freeMMItrainingmethod– BeVersuitedtousingGPUsforparalleliza0on– DoesnotrequireCEtrainingbeforeMMItraining
• ImprovedspeedandSTTperformance– 6%-8%rela0veWERreduc0onoverpreviousbest– 5-10ximprovementintraining0me;3xdecoding0me
SummaryofAdvancedMethods:StayingAheadintheSTTGame
• STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod
• Kaldileadsandkeepsupwithmajorinnova0ons– FromSGMMstoDNN(2012)– From“English”tolow-resourcelanguages(2013)– FromCPUstoGPUs(2014)– Fromclose-talkingtofar-fieldmicrophones(2015)– Fromwell-curatedto“wildtype”corpora(2016)– ChainmodelsforbeVerSTT,fasterdecoding(2017)
• andthelistgoeson...
TeamKaldi@JohnsHopkins• SanjeevKhudanpur• DanielPovey• JanTrmal• GuoguoChen• PegahGhahremani• VimalManohar• VijayadityaPeddin0• HainanXu• XiaohuiZhang• …andseveralothers
KaldiPoints-of-Contact
• Kaldimailinglist– [email protected]
• DanielPovey– [email protected]
• Jan“Yenda”Trmal– [email protected]
• SanjeevKhudanpur– [email protected]– 410-516-7024