building a state-of-the-art asr system with kaldi

BuildingSpeechRecogni0onSystemswiththeKaldiToolkit

SanjeevKhudanpur,DanPoveyandJanTrmalJohnsHopkinsUniversity

CenterforLanguageandSpeechProcessingJune13,2016

Inthebeginning,therewasnothing•  ThenKaldiwasborninBal0more,MD,in2009.

Kaldithengrewup&became…

0

50

100

150

200

250

Jan-12

Mar-12

May-12

Jul-1

2

Sep-12

Nov-12

Jan-13

Mar-13

May-13

Jul-1

3

Sep-13

Nov-13

Jan-14

Mar-14

May-14

Pos$ngstoDiscussionList

60+ContributorsIconfromhVp://thumbs.gograph.com

Meanwhile,SpeechSearchwentfrom“Solved”to“Unsolved”…Again

•  NISTTRECSDR(1998)–  Spoken“document”retrievalfromSTToutputasgoodasretrievalfromreferencetranscripts

–  Speechsearchwasdeclaredasolvedproblem!

•  NISTSTDPilot(2006)–  STTwasfoundtobeinadequateforspoken“term”detec0oninconversa0onaltelephonespeech

•  LimitedlanguagediversityinCTScorpora–  EnglishSwitchboard,CallHomeandFisher– ArabicandMandarinChineseCallHome

In2012,IARPAlaunchedBABELOnemonthaderDanPoveyreturnedtoKaldi’sbirthplace

•  Automa0ctranscrip0onofconversa0onaltelephonespeechwass0llthecorechallenge.

•  Butwithafewsubtle,crucialchanges–  FocusedaVen0ononlow-resourcecondi0ons–  Requiredconcurrentprogressinmul0plelanguages

•  PY1:Cantonese,Tagalog,Pashto,TurkishandVietnamese•  PY2:Assamese,Bengali,Hai0anCreole,Lao,ZuluandTamil

–  Reducedsystemdevelopment0mefromyeartoyear– Usedkeywordsearchmetricstomeasureprogress

KaldiTodayAcommunityofResearchersCoopera0velyAdvancingSTT

•  C++library,command-linetools,STT“recipes”–  FreelyavailableviaGitHub(Apache2.0license)

•  TopSTTperformanceinopenbenchmarktests–  E.g.NISTOpenKWS(2014)andIARPAASpIRE(2015)

•  Widelyadoptedinacademiaandindustry–  300+cita0onsin2014(basedonGooglescholardata)–  400+cita0onsin2015(basedonGooglescholardata)– UsedbyseveralUSandnon-UScompanies

•  Main“trunk”maintainedbyJohnsHopkins–  Forkscontainspecializa0onsbyJHUandothers

Co-PI’s,PhDStudentsandSponsors•  SanjeevKhudanpur•  DanielPovey•  JanTrmal•  GuoguoChen•  PegahGhahremani•  VimalManohar•  VijayadityaPeddin0•  HainanXu•  XiaohuiZhang•  andseveralothers

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous0cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous0cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  Lancerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

SenngupPaths,QueueCommands,…


–  Acous$cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata




PreparingAcous0cTrainingData

data/train/text

data/train/wav.scp

data/train/(uV2spk|spk2uV)

data/train/(cmvn.scp|feats.scp)


–  Acous0cmodeltrainingdata–  Pronuncia$onlexicon–  Languagemodeltrainingdata




PreparingthePronuncia0onLexicon

data/local/dict/lexicon.txt

data/local/dict/*silence*.txt

data/local/lang

WordBoundaryTags

Disambigua0onSymbols

data/lang

data/lang/(phones|words).txt

data/lang/topo

data/lang/phones/roots.txt

data/lang/phones/extra_ques0ons.txt

PreparingtheLanguageModel

local/train_lms_srilm.sh

local/train_lms_srilm.sh(cont’d)

InterpolatedLanguageModels

local/arpa2G.sh



•  BasicGMMsystembuilding–  Acous$cmodeltraining–  Languagemodeltraining



GMMTraining(1)

GMMTraining(2)

cluster-phones,compile-ques0ons,build-tree

GMMTraining(4)

GMMTraining(5)




•  BasicDecoding–  Crea$ngasta$cdecodinggraph–  Lancerescoring


BuildingHCLG(1)

BuildingHCLG(2)

BuildingHCLG(3)

BuildingHCLG(4)

DecodingandLanceRescoring

steps/decode_sgmm2.sh




•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  LaAcerescoring


steps/lmrescore_const_arpa.sh

local/nnet3/run_ivector_common.sh

steps/nnet3/tdnn/make_configs.py

steps/nnet3/train_dnn.py

AdvancedMethods:StayingAheadintheSTTGame

•  STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod

•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)

•  Apreviewofsomeupcomingdevelopments

DeepNeuralNetworksforSTT

DNNAcous0cModelsfortheMasses

•  NontrivialtogettheDNNmodelstoworkwell– Designdecisions:#layers,#nodes,#outputs,typeofnonlinearity,trainingcriterion

–  Trainingart:learningrates,regulariza0on,updatestability(maxchange),datarandomiza0on,#epochs

–  Computa0onalart:matrixlibraries,memorymgmt•  Kaldirecipesprovidearobuststar0ngpoint

Corpus TrainingSpeech SGMMWER DNNWER

BABELPashto 10hours 69.2% 67.6%

BABELPashto 80hours 50.2% 42.3%

FisherEnglish 2000hours 15.4% 10.3%

Low-ResourceSTTfortheMasses

•  Kaldiprovideslanguage-independentrecipes– TypicalBABELFullLPcondi0on

•  80hoursoftranscribedspeech,800KwordsofLMtext,20Kwordpronuncia0onlexicon

– TypicalBABELLimitedLPcondi0on•  10hoursoftranscribedspeech,100KwordsofLMtext,6Kwordpronuncia0onlexicon

Language Cantonese Tagalog Pashto Turkish

Speech 80h 10h 80h 10h 80h 10h 80h 10h

CER/WER 48.5% 61.2% 46.3% 61.9% 50.7% 63.0% 51.3% 65.3%

ATWV 0.47 0.26 0.56 0.28 0.46 0.25 0.52 0.25

Parallel(GPU-based)Training

•  Originalneuralnetworktrainingalgorithmswereinherentlysequen0al(e.g.SGD)

•  Scalingupto“bigdata”becomesachallenge•  Severalsolu0onshaveemergedrecently– 2009:DelayedSGD(Yahoo!)– 2011:Lock-freeSGD(Hogwild!UWisconsin)– 2012:Gradientaveraging(DistBelief,Google)– 2014:Modelaveraging(NG-SGD,Kaldi)

ModelAveragingwithNG-SGD

•  TrainDNNswithlargeamountofdata– U0lizeaclusterofCPUsorGPUs– Minimizenetworktraffic(esp.forCPUs)

•  Solu0on:exploitdataparalleliza0on– Updatemodelinparallelovermanymini-batches–  Infrequentlyaveragemodels(parameters)

•  Use“Natural-Gradient”SGDformodelupda0ng– Approximatescondi0oningviainverseFishermatrix–  Improvesconvergenceevenwithoutparalleliza0on

Paralleliza0onMaVers!

•  Typically,aGPUis10xfasterthana16coreCPU•  Linearspeed-up0llca4GPUs,thendiminishing

IARPA’sOpenChallenge•  Automa0cspeechrecogni0onsodwarethatworksinavariety

ofacous0cenvironmentsandrecordingscenariosisaholygrailofthespeechresearchcommunity.

•  IARPA’sAutoma0cSpeechrecogni0onInReverberantEnvironments(ASpIRE)Challengeisseekingthatgrail.

RulesoftheASpIREChallenge

•  15hoursofspeechdatawerepostedontheIARPAwebsite–  Mul0-microphonerecordingsofconversa0onalEnglish–  5hdevelopmentset(dev),10hdevelopment-testset(dev-test)–  Transcrip0onsprovidedfordev,onlyscoringfordev-testoutput–  Fortrainingdataselec0on,systemdevelopmentandtuning

•  12hoursofnewspeechdataduringtheevalua0onperiod–  Far-fieldspeech(eval)fromnoisy,reverberantrooms–  Single-microphoneormul0-microphonecondi0ons

•  Worderrorrateisthemeasureofperformance–  Single-microphonesubmissionsweredueon02/18/2015–  Resultswereofficiallyannouncedon09/10/2015

ExamplesofASpIREAudio

•  Typicalsample–  SuggestedbyDr.MaryHarper

•  Almostmanageable–  Easyforhumans,26%errorsforASR

•  Somewhathard–  Easyforhumans,41%errorsforASR

•  Muchharder– Noteasyforhumans,60%errorsforASR

•  *#@!!#%#%^^–  Veryhardforhumans,noASRoutput

KaldiASRImprovementsforASpIRE

•  Timedelayneuralnetworks(TDNN)–  Awaytodealwithlongacous0c-phone0ccontext–  Astructuredalterna0vetodeep/recurrentneuralnets

•  Dataaugmenta$onwithsimulatedreverbera0ons–  Awaytomi0gatechanneldistor0onsnotseenintraining–  Aformofmul0-condi0ontrainingofASRmodels

•  i-vectorbasedspeaker&environmentadapta$on–  Awaytodealwithspeaker&channelvariability–  Adapted[withatwist]fromSpeakerIDsystems

KaldiASRImprovements,ASpIRE++

•  Pronuncia$onandinter-wordsilencemodeling–  Inspiredbypronuncia0on-prosodyinterac0ons–  Asimplecontext-dependentmodelofinter-wordsilence

•  Recurrentneuralnetworklanguagemodels(RNNLM)–  A(known)waytomodellong-rangeworddependencies–  Incorporatedpost-submissionintoJHUASpIREsystem

•  OngoingKaldiinves0ga0onsthatholdpromise–  Semi-superviseddiscrimina0vetrainingof(T)DNNs–  Longshort-termmemory(LSTM)acous0cmodels–  Connec0onisttemporalclassifica0on(CTC)models

TimeDelayNeuralNetworks(SeeourpaperatINTERSPEECH2015fordetails)

A28YearOldIdea,Resurrected

AlexWaibel,KevinLang,etal(1987)t-13 t+9

t+7

t+5 t-10

t-7 t+2

t-1 t-4

t-8 t-5 t-2 t+1 t+4 t-11

-7 +2

-3 +3 -3 +3

-1 +2 -1 +2 -1 +2 -1 +2

Layer 4

Layer 3

Layer 2

Layer 1 +2 -2

OurTDNNArchitecture(2015)

1

5

2

3

4

ImprovedASRonSeveralDataSets

•  Consistent5-10%reduc0oninworderrorrate(WER)overDNNsonmostdatasets,includingconversa0onalspeech.

•  TDNNtrainingspeedsareonparwithDNN,andnearlyanorderofmagnitudefasterthanRNN

StandardASRTestSets Size DNN TDNN Rel.Δ

WallStreetJournal 80hrs 6.6% 6.2% 5%

TED-LIUM 118hrs 19.3% 17.9% 7%

Switchboard 300hrs 15.5% 14.0% 10%

LibriSpeech 960hrs 5.2% 4.8% 7%

FisherEnglish 1800hrs 22.2% 21.0% 5%

ASpIRE(FisherTraining) 1800hrs 47.7% 47.6%

DataAugmenta0onforASRTraining(SeeourpaperatINTERSPEECH2015fordetails)

Simula0ngReverberantSpeechforMul0-condi0on(T)DNNTraining

•  Simulateca5500hoursofreverberant,noisydatafrom1800hoursoftheFisherEnglishCTScorpus–  Replicateeachoftheca21,000conversa0onsides30mes–  Randomlychangethesamplingrate[upto±10%]–  Convolveeachconversa0onsidewithoneof320real-liferoomimpulseresponses(RIR)chosenatrandom

–  Addnoisetothesignal(whenavailablewiththeRIR)•  Generate(T)DNNtraininglabelsfromcleanspeech–  Align“pre-reverb”speechtoca7500CD-HMMstates

•  TrainDNNandTDNNacous0cmodels–  Cross-entropytrainingfollowedbysequencetraining

ResultofDataAugmenta0onAcous$cModel DataAugmenta$on DevWER

TDNNA(230ms) None(1800h,cleanspeech) 47.6%

TDNNA(230ms) +3x(reverbera0on+noise) 31.7%

TDNNB(290ms) +3x(reverbera0on+noise) 30.8%

TDNNA(230ms) +samplingrateperturba0on 31.0%

TDNNB(290ms) +samplingrateperturba0on 31.1%

•  Dataaugmenta0onwithsimulatedreverbera0onisbeneficial–  Likelytobeaveryimportantreasonforrela0velygoodperformance

•  Samplingrateperturba0ondidn’thelpmuchonASpIREdata•  SequencetraininghelpedreduceWERonthedevset

–  RequiredmodifyingthesMBRtrainingcriteriontorealizegains–  Butthegainsdidnotcarryovertodev-testset

i-vectorsforSpeakerCompensa0on(SeeourpaperatINTERSPEECH2015fordetails)

Usingi-vectorsInsteadoffMLLRandusingunnormalizedMFCCstocomputei-vectors

•  100-dimi-vectorsareappendedtoMFCCinputsoftheTDNN–  i-vectorsarecomputedfromrawMFCCs(i.e.nomeansubtrac0onetc)–  UBMposteriorshoweveruseMFCCsnormalizedovera6secwindow

•  i-vectorsarecomputedforeachtraininguVerance–  Increasesspeaker-andchannelvariabilityseenintrainingdata–  Maymodeltransientdistor0ons?e.g.movingspeakers,passingcars

•  i-vectorsarecalculatedforeveryca60secoftestaudio–  UBMpriorisweighted10:1topreventovercompensa0on–  Weightofteststa0s0csiscappedat75:1rela0vetoUBMsta0s0cs

SpeakerCompensa$onMethod DevWER

TDNNwithouti-vectors 34.8%

+i-vectors(fromallframes) 33.8%

+i-vectors(fromreliablespeechframes) 30.8%

Pronuncia0onandSilenceProbabili0es(SeeourpaperatINTERSPEECH2015fordetails)

Trigram-likeInter-wordSilenceModel

P s a_b( ) = P s a_( )F s _b( )

F s _b( ) =c sb( )+λ3

c !a *b( )P s !a _( )!a∑ +λ3

P s a_( ) =c as( )+λ2P s( )c a( )+λ2

Is“Prosody”FinallyHelpingSTT?Task TestSet Baseline +Sil/Pron

WSJ Eval92 4.1 3.9

Switchboard Eval2000 20.5 20.0

TED-LIUM Test 18.1 17.9

LibriSpeechTestClean 6.6 6.6

TestOther 22.9 22.5

•  Modelingpronuncia0onandsilenceprobabili0esyieldsmodestbutconsistentimprovementonmanylargevocabularyASRtasks

Pronuncia$on/SilenceProbabili$es DevWER

Noprobabili0esinthelexicon 32.1%

+pronuncia0onprobabili0es 31.6%

+inter-wordsilenceprobabili0es 30.8%

RecurrentNeuralNetworkbasedLanguageModels

(SeeourpaperatINTERSPEECH2010forthefirst“convincing”results)

RNNLMonASpIREDataLanguageModelandRescoringMethod DevWER

4-gramLMandlancerescoring 30.8%

RNN-LMand100-bestrescoring 30.2%

RNN-LMand1000-bestrescoring 29.9%

RNN-LM(4-gramapproxima0on)lancerescoring 29.9%

RNN-LM(6-gramapproxima0on)lancerescoring 29.8%

•  AnRNNLMconsistentlyoutperformstheN-gramLM•  TheKaldilancerescoringappearstocausenolossin

performance–  Approxima0onentailsnot“expanding”thelancetorepresenteachuniquehistoryseparately

–  WhentwopathsmergeinanN-gramlance,onlyones(t)ischosenatrandomandpropagatedforward

TheIARPAASpIRELeaderBoard

Rank Par$cipant DevWER SystemType

1 tsakilidis 27.2% Combina0on

2 rhsiao 27.5% Combina0on

3 vijaypeddin0 27.7% SingleSystem

hVp://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge

PerformanceonEvalua0onData

Acous$cModel LanguageModel DevWER

TestWER

EvalWER

TDNNB(CEtraining) 4-gram 30.8% 27.7% 44.3%

TDNNB(sMBRtraining) 4-gram 29.1% 28.9% 43.9%

TDNNB(CEtraining) RNN 29.8% 26.5% 43.4%

TDNNB(sMBRtraining) RNN 28.3% 28.2% 43.4%

Par$cipant TestWER SystemType

Kaldi 44.3% SingleSystem

BBN(andothers) 44.3% Combina0on

I2R(Singapore) 44.8% Combina0on

KeystoGoodPerformanceonASpIRE

•  Timedelayneuralnetworks(TDNN)– Dealwellwithlongreverbera0on0mes

•  i-vectorbasedadapta0oncompensa0on– Dealswithspeaker&channelvariability

•  Dataaugmenta0onwithsimulatedreverbera0ons– Dealswithchanneldistor0onsnotseenintraining

•  Pronuncia0onandinter-wordsilenceprobabili0es– Helpfulinadverseacous0ccondi0ons

TheJHUASpIRESystem(SeeourASRU2015paperfordetails)

Semi-supervisedMMITraining(SeeourpaperatINTERSPEECH2015fordetails)

Discrimina0ve(MMI)Training:ahand-waving,mostlycorrectintroduc0on

θ̂ML = argmaxθ

logP Ot Wt ;θ( )t=1

T

∑ KL P̂ Pθ( )

θ̂MMI = argmaxθ

logP Ot Wt;θ( )

P Ot !Wt;θ( )P !Wt( )!Wt

∑

#

$%

&%

'

(%

)%t=1

T

∑ I W ∧O;θ( )

Cross-en

trop

ytraining

Sequ

encetraining

Semi-SupervisedSequenceTraining

•  Sequencetrainingimprovessubstan0allyoverbasiccross-entropytrainingofDNNacous0cmodels

•  Semi-supervisedcross-entropytraining–byaddingunlabeleddata–alsoimprovessubstan0allyoverbasiccross-entropytrainingonlabeleddata

•  Butsemi-supervisedsequencetrainingis“tricky”–  Sensi0vitytoincorrecttranscrip0onseemsgreater–  Confidence-basedfilteringorweigh0ngmustbeapplied–  Empiricalresultsarenotverysa0sfactory

Semi-supervisedSequenceTraining:withoutcomminngtoasingletranscrip0on

•  ViewMMItrainingasminimizingacondi0onalentropy

I W ∧O ;θ( ) =1T

logP Ot Wt ;θ( )P Ot ;θ( )t=1

T

∑ =1T

logP Ot Wt ;θ( )

P Ot #W ; θ )P #W( )(#W∑t=1

T

∑

I W ∧O ;θ( ) = H W( ) − H W O ;θ( ) = H W( ) − 1T

H W Ot ;θ( )t=1

T

∑

•  ThelaVerdoesnotrequirecomminngtoasingleWt–  Wellsuitedforunlabeledspeech–  Entailscompu0ngasumoverallW’sinthelance

Compu0ngLanceEntropyUsingExpecta0onSemi-rings

•  Howtoefficientlycompute

•  Replacearc-probabili0espiwiththepair(pi,pilog{pi})

−H W Ot ;θ( ) = P π( ) logP π( )π∈L∑

Z Ot ;θ( ) = P π( )π∈L∑

Semi-ringElement&Operators (p,p×log{p})

(p1,p1log{p1})+(p2,p2log{p2}) (p1+p2,p1log{p1}+p2log{p2})

(p1,p1log{p1})×(p2,p2log{p2}) (p1p2,p1p2log{p2}+p2p1log{p1})

•  Takeinspira0onfromthecomputa0onof

Semi-supervisedSequenceTraining:KeyDetailsNeededtoMakeitWork

•  ViewtrainingcriterionasMCEinsteadofMMI–  i.e.argminH(W|O;θ)insteadofargmaxI(W∧O;θ)–  EfficientlycomputeH(W|O;θ)forthelance,anditsgradient,viaBaumWelchwithspecialsemi-rings

•  Useseparateoutput(sod-max)layersintheDNNforlabeledandunlabeleddata–  Inspiredbymul0lingualDNNtrainingmethods

•  Useaslightlydifferent“prior”forconver0ngDNNposteriorprobabili0estoacous0clikelihoods

ResultsforSemi-SupervisedMMIonFisherEnglishCTS

DNNTrainingMethod(hoursofspeech) DevWER TestWER

Cross-EntropyTraining(100hlabeled) 32.0 31.2

CE(100hlabeled+250hself-labeled) 30.6 29.8

CE(100hlabeled+250hweighted) 30.5 29.8

SequenceTraining(100hlabeled) 29.6 28.5

SeqTraining(100hlabeled+250hweighted) 29.9 28.8

SeqTraining(100hlabeled+250hMCE) 29.4 28.1

SequenceTraining(350hlabeled) 28.5 27.5

•  Recoversabout40%ofthesupervisedtraininggain–  Inves0ga0onunderwayfor2000hofunlabeledspeech

•  RepeatableresultsonBABELdatasetswith10hsupervisedtraining+50-70hunsupervised

Know

nuseof

unlabe

leddata

Bejeruseof

unlabe

leddata

HeterogeneousTrainingCorpora•  Transcribedspeechfromdifferentcollec0onsarenoteasy

tomergeforSTTtraining–  Genreandspeakingstyledifferences–  Differentchannelcondi0ons–  Slightlydifferenttranscrip0onconven0ons

•  Typicalresult:thecorpusmatchedtotestdatagivesbestSTTresults;othersdon’thelp,some0meshurt!

•  SCALE2015casestudywithPashtoCTS–  Collectedincountry,andtranscribed,bysamevendor–  Roughly80hourseachinthe

•  AppenLILAcorpusandIARPABABELcorpus–  Pronuncia0onlexicontocovertranscripts;samephoneset

AStudyinPashto(Amanuscriptisinprepara0onforfuturepublica0on)

AStudyinPashto•  Transcrip0onsrequire

extensivecross-corpusnormaliza0on

•  Evenaderthat,languagemodelsdon’tbenefitmuchfromcorpuspooling

•  Simplecorpuspoolingdoesn’timproveacous0cmodelingeither

•  DNNswithshared“inner”layersandcorpus-specificinputandoutputlayersworkbest

TrainingData

SingleModel

Interpola$onWeightsLMALMBLMT

InterpolatedModel

TextA 99.2 0.8 0.2 0.0 92.9

TextB 141.9 0.1 0.8 0.1 140.0

TextT 86.7 0.0 0.0 1.0 86.7

Mul$-corpus(A+B)TrainingStrategy

STTWordErrorRatesTestATestBTestT

SharedDNNlayers(except1) 53.2% 47.4% 27.0%

SharedDNNlayers(except2) 51.2% 45.0% 25.4%

+Op0mizedLanguageModel 50.8% 44.8% 25.4%

+Dura0onModeling 50.4% 44.3% 24.8%

DNNTrainingData STTWordErrorRatesTestATestBTestT

Singlecorpus(matched) 55.4% 46.8% 24.8%

Twocorpora(PashtoA+B) 51.9% 48.2% 52.6%

OtherAddi0onsandInnova0ons•  Semi-supervised(MMI)training– Usingunlabeledspeechtoaugmentalimitedtranscribedspeechcorpus

•  Mul0lingualacous0cmodeltraining– Usingother-languagespeechtoaugmentalimitedtranscribedspeechcorpus

•  Removingrelianceonpronuncia0onlexicons– Graphemebasedmodelsandacous0callyaidedG2P

•  Chainmodels–  10%moreaccurateSTT,plus–  3xfasterdecoding,and5x-10xfastertraining

TheGenesisofChainModels

•  Connec0onistTemporalClassifica0on–  Thelatestshinytoyinneuralnetwork-basedacous0cmodelingforSTT(ICASSPandInterSpeech2015)

– NiceSTTimprovementsshownonGoogledatasets– Wehaven’tseenSTTgainsonourdatasets

•  ChainModels–  Inspiredby(butquitedifferentfrom)CTC–  SequencetrainingofNNswithoutCEpre-training– NiceSTTimprovementsoverpreviousbestsystems–  3xdecoding0mespeed-up;5x-10xtrainingspeed-up

2006:ANewKidontheNNetBlock

2015:TheNewKidComesofAge

CTC,Explained…inPicturesFigurefromGravesetal,ICML2006

dh ax s aw n d

CTC,Explained…inPicturesFigurefromGravesetal,ICML2006

dh ax s aw n d

β

ββ

β β β β

DNNversusCTC:STTPerformanceFiguresandTablesfromSaketal,ICASSP2015

DNN Target CE sMBR

LSTM Senone 10.0% 8.9%

BLSTM Senone 9.7% 9.1%

CTC Target CE sMBR

LSTM Phone 10.5% 9.4%

BLSTM Phone 9.5% 8.5%

First,theBadNews…

•  Wehaven’tbeenabletogetCTCmodelstogiveusanyno0ceableimprovementoverourbest(TDNNorLSTM-RNN)modelsonourdata–  Itappearstobeeasiertogetthemtoworkwhenonehasseveral1000hoursoflabeledspeech

– Butwecareaboutlower-resourcescenarios

…andthentheGoodNews

•  Weareabletogetsimilarimprovementsusingadifferentmodel,whichisinspiredbyideasfromtheCTCpapers– Usesimple“1-state”HMMsforeachCDphone– Reduceframeratefrom100Hzto33Hz– Permitslackintheframe-to-statealignment

ChainModelsandLF-MMITraining

•  Anewclassofacous0cmodelsforhybridSTT–  “1-state”HMMforeachcontext-dependentphone–  LSTM/TDNNscomputestateposteriorprobabili0es

•  MFCCsaredown-sampledfrom100Hzto33Hz–  InspiredbyCTC

•  Anewlance-freeMMItrainingmethod–  Improvedparalleliza0on,sequencetrainingonGPUs

•  Largermini-batches,smallerI/Obandwidth– DoesnotrequireCEtrainingbeforeMMItraining– Uses“flexiblelabelalignment”inspiredbyCTC

Discrimina0ve(MMI)Training:ahand-waving,mostlycorrectintroduc0on

θ̂ML = argmaxθ

logP Ot Wt ;θ( )t=1

T

∑ KL P̂ Pθ( )

θ̂MMI = argmaxθ

logP Ot Wt;θ( )

P Ot !Wt;θ( )P !Wt( )!Wt

∑

#

$%

&%

'

(%

)%t=1

T

∑ I W ∧O;θ( )

Lance-FreeMMITraining

•  Denominator(phone)graphcrea0on– Useaphone4-gramlanguagemodel,L–  ComposeH,CandLtoobtaindenominatorgraph

•  ThisFSAisthesameforalluVerances;suitsGPUtraining•  Use(heuris0c)sentence-specificini0alprobabili0es

•  Numeratorgraphcrea0on– Generateaphonegraphusingtranscripts

•  ThisFSAencodesframe-by-framealignmentofHMMstates–  Permitsomealignment“slack”foreachframe/label–  IntersectslackenedFSAwiththedenominatorFSA

Regulariza$on Hub-5‘00WordErrorRate

CrossEntropy L2Norm LeakyHMM Total SWBDN N N 16.8% 11.1%Y N N 15.9% 10.5%N Y N 15.9% 10.4%N N Y 16.4% 10.9%Y Y N 15.7% 10.3%Y N Y 15.7% 10.3%N Y Y 15.8% 10.4%Y Y Y 15.6% 10.4%

Lance-freeMMITraining(cont’d)•  LSTM-RNNstrainedwiththisMMItrainingprocedureare

highlysuscep0bletoover-finng•  Essen0altoregularizetheNNtrainingprocess

–  AsecondoutputlayerforCEtraining–  OutputL2regulariza0on–  UsealeakyHMM

STTResultsforChainModels300hoursofSWBDTrainingSpeech;Hub-5‘00Evalua0onSet

TrainingObjec$ve Model(Size) TotalWER

SWBDWER

Cross-Entropy TDNNA(16.6M) 18.2% 12.5%

CE+sMBR TDNNA(16.6M) 16.9% 11.4%

Lance-freeMMI

TDNNA(9.8M) 16.1% 10.7%

TDNNB(9.9M) 15.6% 10.4%

TDNNC(11.2M) 15.5% 10.2%

LF-MMI+sMBR TDNNC(11.2M) 15.1% 10.0%

•  LF-MMIreducesWERbyca10%-15%rela>ve•  LF-MMIisbeVerthanstandardCE+sMBRtraining(ca8%)•  LF-MMIimprovesveryslightlywithaddi0onalsMBRtraining

ChainModelsandLF-MMITrainingSTTPerformanceonaVarietyofCorpora

CorpusandAudioType

TrainingSpeech

CE+sMBRErrorRate

LF-MMIErrorRate

AMIIHM 80hours 23.8% 22.4%

AMISDM 80hours 48.9% 46.1%

TED-LIUM 118hours 11.3% 12.8%

Switchboard 300hours 16.9% 15.5%

Fisher+SWBD 2100hours 15.0% 13.3%

•  ChainmodelswithLF-MMIreduceWERby6%-11%(rela>ve)•  LF-MMIimprovesabitfurtherwithaddi0onalsMBRtraining•  FL-MMIis5x-10xfastertotrain,3xfastertodecode

ARecapofChainModels

•  Anewclassofacous0cmodelsforhybridSTT–  “1-state”HMMforcontext-dependentphones–  LSTM-RNNacous0cmodels(TDNNalsocompa0ble)

•  Anewlance-freeMMItrainingmethod–  BeVersuitedtousingGPUsforparalleliza0on– DoesnotrequireCEtrainingbeforeMMItraining

•  ImprovedspeedandSTTperformance–  6%-8%rela0veWERreduc0onoverpreviousbest–  5-10ximprovementintraining0me;3xdecoding0me

SummaryofAdvancedMethods:StayingAheadintheSTTGame


•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)–  ChainmodelsforbeVerSTT,fasterdecoding(2017)

•  andthelistgoeson...

TeamKaldi@JohnsHopkins•  SanjeevKhudanpur•  DanielPovey•  JanTrmal•  GuoguoChen•  PegahGhahremani•  VimalManohar•  VijayadityaPeddin0•  HainanXu•  XiaohuiZhang•  …andseveralothers

KaldiPoints-of-Contact

•  Kaldimailinglist– [email protected]

•  DanielPovey– [email protected]

•  Jan“Yenda”Trmal–  [email protected]

•  SanjeevKhudanpur– [email protected]– 410-516-7024

building a state-of-the-art asr system with kaldi

Documents