building a state-of-the-art asr system with kaldi

121
Building Speech Recogni0on Systems with the Kaldi Toolkit Sanjeev Khudanpur, Dan Povey and Jan Trmal Johns Hopkins University Center for Language and Speech Processing June 13, 2016

Upload: vandiep

Post on 31-Dec-2016

256 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Building a State-of-the-Art ASR System with Kaldi

BuildingSpeechRecogni0onSystemswiththeKaldiToolkit

SanjeevKhudanpur,DanPoveyandJanTrmalJohnsHopkinsUniversity

CenterforLanguageandSpeechProcessingJune13,2016

Page 2: Building a State-of-the-Art ASR System with Kaldi

Inthebeginning,therewasnothing•  ThenKaldiwasborninBal0more,MD,in2009.

Page 3: Building a State-of-the-Art ASR System with Kaldi

Kaldithengrewup&became…

0

50

100

150

200

250

Jan-12

Mar-12

May-12

Jul-1

2

Sep-12

Nov-12

Jan-13

Mar-13

May-13

Jul-1

3

Sep-13

Nov-13

Jan-14

Mar-14

May-14

Pos$ngstoDiscussionList

60+ContributorsIconfromhVp://thumbs.gograph.com

Page 4: Building a State-of-the-Art ASR System with Kaldi

Meanwhile,SpeechSearchwentfrom“Solved”to“Unsolved”…Again

•  NISTTRECSDR(1998)–  Spoken“document”retrievalfromSTToutputasgoodasretrievalfromreferencetranscripts

–  Speechsearchwasdeclaredasolvedproblem!

•  NISTSTDPilot(2006)–  STTwasfoundtobeinadequateforspoken“term”detec0oninconversa0onaltelephonespeech

•  LimitedlanguagediversityinCTScorpora–  EnglishSwitchboard,CallHomeandFisher– ArabicandMandarinChineseCallHome

Page 5: Building a State-of-the-Art ASR System with Kaldi

In2012,IARPAlaunchedBABELOnemonthaderDanPoveyreturnedtoKaldi’sbirthplace

•  Automa0ctranscrip0onofconversa0onaltelephonespeechwass0llthecorechallenge.

•  Butwithafewsubtle,crucialchanges–  FocusedaVen0ononlow-resourcecondi0ons–  Requiredconcurrentprogressinmul0plelanguages

•  PY1:Cantonese,Tagalog,Pashto,TurkishandVietnamese•  PY2:Assamese,Bengali,Hai0anCreole,Lao,ZuluandTamil

–  Reducedsystemdevelopment0mefromyeartoyear– Usedkeywordsearchmetricstomeasureprogress

Page 6: Building a State-of-the-Art ASR System with Kaldi

KaldiTodayAcommunityofResearchersCoopera0velyAdvancingSTT

•  C++library,command-linetools,STT“recipes”–  FreelyavailableviaGitHub(Apache2.0license)

•  TopSTTperformanceinopenbenchmarktests–  E.g.NISTOpenKWS(2014)andIARPAASpIRE(2015)

•  Widelyadoptedinacademiaandindustry–  300+cita0onsin2014(basedonGooglescholardata)–  400+cita0onsin2015(basedonGooglescholardata)– UsedbyseveralUSandnon-UScompanies

•  Main“trunk”maintainedbyJohnsHopkins–  Forkscontainspecializa0onsbyJHUandothers

Page 7: Building a State-of-the-Art ASR System with Kaldi

Co-PI’s,PhDStudentsandSponsors•  SanjeevKhudanpur•  DanielPovey•  JanTrmal•  GuoguoChen•  PegahGhahremani•  VimalManohar•  VijayadityaPeddin0•  HainanXu•  XiaohuiZhang•  andseveralothers

Page 8: Building a State-of-the-Art ASR System with Kaldi

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous0cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous0cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  Lancerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

Page 9: Building a State-of-the-Art ASR System with Kaldi

SenngupPaths,QueueCommands,…

Page 10: Building a State-of-the-Art ASR System with Kaldi

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous$cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous0cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  Lancerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

Page 11: Building a State-of-the-Art ASR System with Kaldi

PreparingAcous0cTrainingData

Page 12: Building a State-of-the-Art ASR System with Kaldi

data/train/text

Page 13: Building a State-of-the-Art ASR System with Kaldi

data/train/wav.scp

Page 14: Building a State-of-the-Art ASR System with Kaldi

data/train/(uV2spk|spk2uV)

Page 15: Building a State-of-the-Art ASR System with Kaldi

data/train/(cmvn.scp|feats.scp)

Page 16: Building a State-of-the-Art ASR System with Kaldi

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous0cmodeltrainingdata–  Pronuncia$onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous0cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  Lancerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

Page 17: Building a State-of-the-Art ASR System with Kaldi

PreparingthePronuncia0onLexicon

Page 18: Building a State-of-the-Art ASR System with Kaldi

data/local/dict/lexicon.txt

Page 19: Building a State-of-the-Art ASR System with Kaldi

data/local/dict/*silence*.txt

Page 20: Building a State-of-the-Art ASR System with Kaldi

data/local/lang

Page 21: Building a State-of-the-Art ASR System with Kaldi

WordBoundaryTags

Page 22: Building a State-of-the-Art ASR System with Kaldi

Disambigua0onSymbols

Page 23: Building a State-of-the-Art ASR System with Kaldi

data/lang

Page 24: Building a State-of-the-Art ASR System with Kaldi

data/lang/(phones|words).txt

Page 25: Building a State-of-the-Art ASR System with Kaldi

data/lang/topo

Page 26: Building a State-of-the-Art ASR System with Kaldi

data/lang/phones/roots.txt

Page 27: Building a State-of-the-Art ASR System with Kaldi

data/lang/phones/extra_ques0ons.txt

Page 28: Building a State-of-the-Art ASR System with Kaldi

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous0cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous0cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  Lancerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

Page 29: Building a State-of-the-Art ASR System with Kaldi

PreparingtheLanguageModel

Page 30: Building a State-of-the-Art ASR System with Kaldi

local/train_lms_srilm.sh

Page 31: Building a State-of-the-Art ASR System with Kaldi

local/train_lms_srilm.sh(cont’d)

Page 32: Building a State-of-the-Art ASR System with Kaldi

InterpolatedLanguageModels

Page 33: Building a State-of-the-Art ASR System with Kaldi

local/arpa2G.sh

Page 34: Building a State-of-the-Art ASR System with Kaldi

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous0cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous$cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  Lancerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

Page 35: Building a State-of-the-Art ASR System with Kaldi

GMMTraining(1)

Page 36: Building a State-of-the-Art ASR System with Kaldi

GMMTraining(2)

Page 37: Building a State-of-the-Art ASR System with Kaldi

cluster-phones,compile-ques0ons,build-tree

Page 38: Building a State-of-the-Art ASR System with Kaldi

GMMTraining(4)

Page 39: Building a State-of-the-Art ASR System with Kaldi

GMMTraining(5)

Page 40: Building a State-of-the-Art ASR System with Kaldi

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous0cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous0cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea$ngasta$cdecodinggraph–  Lancerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

Page 41: Building a State-of-the-Art ASR System with Kaldi

BuildingHCLG(1)

Page 42: Building a State-of-the-Art ASR System with Kaldi

BuildingHCLG(2)

Page 43: Building a State-of-the-Art ASR System with Kaldi

BuildingHCLG(3)

Page 44: Building a State-of-the-Art ASR System with Kaldi

BuildingHCLG(4)

Page 45: Building a State-of-the-Art ASR System with Kaldi

DecodingandLanceRescoring

Page 46: Building a State-of-the-Art ASR System with Kaldi

steps/decode_sgmm2.sh

Page 47: Building a State-of-the-Art ASR System with Kaldi

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous0cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous0cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  LaAcerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

Page 48: Building a State-of-the-Art ASR System with Kaldi

steps/lmrescore_const_arpa.sh

Page 49: Building a State-of-the-Art ASR System with Kaldi

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous0cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous0cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  Lancerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

Page 50: Building a State-of-the-Art ASR System with Kaldi

local/nnet3/run_ivector_common.sh

Page 51: Building a State-of-the-Art ASR System with Kaldi

steps/nnet3/tdnn/make_configs.py

Page 52: Building a State-of-the-Art ASR System with Kaldi

steps/nnet3/train_dnn.py

Page 53: Building a State-of-the-Art ASR System with Kaldi

BuildinganSTTSystemwithKaldi•  Dataprepara0on

–  Acous0cmodeltrainingdata–  Pronuncia0onlexicon–  Languagemodeltrainingdata

•  BasicGMMsystembuilding–  Acous0cmodeltraining–  Languagemodeltraining

•  BasicDecoding–  Crea0ngasta0cdecodinggraph–  Lancerescoring

•  BasicDNNsystembuilding•  Goingbeyondthebasics

Page 54: Building a State-of-the-Art ASR System with Kaldi

AdvancedMethods:StayingAheadintheSTTGame

•  STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod

•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)

•  Apreviewofsomeupcomingdevelopments

Page 55: Building a State-of-the-Art ASR System with Kaldi

AdvancedMethods:StayingAheadintheSTTGame

•  STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod

•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)

•  Apreviewofsomeupcomingdevelopments

Page 56: Building a State-of-the-Art ASR System with Kaldi

DeepNeuralNetworksforSTT

Page 57: Building a State-of-the-Art ASR System with Kaldi

DNNAcous0cModelsfortheMasses

•  NontrivialtogettheDNNmodelstoworkwell– Designdecisions:#layers,#nodes,#outputs,typeofnonlinearity,trainingcriterion

–  Trainingart:learningrates,regulariza0on,updatestability(maxchange),datarandomiza0on,#epochs

–  Computa0onalart:matrixlibraries,memorymgmt•  Kaldirecipesprovidearobuststar0ngpoint

Corpus TrainingSpeech SGMMWER DNNWER

BABELPashto 10hours 69.2% 67.6%

BABELPashto 80hours 50.2% 42.3%

FisherEnglish 2000hours 15.4% 10.3%

Page 58: Building a State-of-the-Art ASR System with Kaldi

AdvancedMethods:StayingAheadintheSTTGame

•  STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod

•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)

•  Apreviewofsomeupcomingdevelopments

Page 59: Building a State-of-the-Art ASR System with Kaldi

Low-ResourceSTTfortheMasses

•  Kaldiprovideslanguage-independentrecipes– TypicalBABELFullLPcondi0on

•  80hoursoftranscribedspeech,800KwordsofLMtext,20Kwordpronuncia0onlexicon

– TypicalBABELLimitedLPcondi0on•  10hoursoftranscribedspeech,100KwordsofLMtext,6Kwordpronuncia0onlexicon

Language Cantonese Tagalog Pashto Turkish

Speech 80h 10h 80h 10h 80h 10h 80h 10h

CER/WER 48.5% 61.2% 46.3% 61.9% 50.7% 63.0% 51.3% 65.3%

ATWV 0.47 0.26 0.56 0.28 0.46 0.25 0.52 0.25

Page 60: Building a State-of-the-Art ASR System with Kaldi

AdvancedMethods:StayingAheadintheSTTGame

•  STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod

•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)

•  Apreviewofsomeupcomingdevelopments

Page 61: Building a State-of-the-Art ASR System with Kaldi

Parallel(GPU-based)Training

•  Originalneuralnetworktrainingalgorithmswereinherentlysequen0al(e.g.SGD)

•  Scalingupto“bigdata”becomesachallenge•  Severalsolu0onshaveemergedrecently– 2009:DelayedSGD(Yahoo!)– 2011:Lock-freeSGD(Hogwild!UWisconsin)– 2012:Gradientaveraging(DistBelief,Google)– 2014:Modelaveraging(NG-SGD,Kaldi)

Page 62: Building a State-of-the-Art ASR System with Kaldi
Page 63: Building a State-of-the-Art ASR System with Kaldi

ModelAveragingwithNG-SGD

•  TrainDNNswithlargeamountofdata– U0lizeaclusterofCPUsorGPUs– Minimizenetworktraffic(esp.forCPUs)

•  Solu0on:exploitdataparalleliza0on– Updatemodelinparallelovermanymini-batches–  Infrequentlyaveragemodels(parameters)

•  Use“Natural-Gradient”SGDformodelupda0ng– Approximatescondi0oningviainverseFishermatrix–  Improvesconvergenceevenwithoutparalleliza0on

Page 64: Building a State-of-the-Art ASR System with Kaldi

Paralleliza0onMaVers!

•  Typically,aGPUis10xfasterthana16coreCPU•  Linearspeed-up0llca4GPUs,thendiminishing

Page 65: Building a State-of-the-Art ASR System with Kaldi

AdvancedMethods:StayingAheadintheSTTGame

•  STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod

•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)

•  Apreviewofsomeupcomingdevelopments

Page 66: Building a State-of-the-Art ASR System with Kaldi

IARPA’sOpenChallenge•  Automa0cspeechrecogni0onsodwarethatworksinavariety

ofacous0cenvironmentsandrecordingscenariosisaholygrailofthespeechresearchcommunity.

•  IARPA’sAutoma0cSpeechrecogni0onInReverberantEnvironments(ASpIRE)Challengeisseekingthatgrail.

Page 67: Building a State-of-the-Art ASR System with Kaldi

RulesoftheASpIREChallenge

•  15hoursofspeechdatawerepostedontheIARPAwebsite–  Mul0-microphonerecordingsofconversa0onalEnglish–  5hdevelopmentset(dev),10hdevelopment-testset(dev-test)–  Transcrip0onsprovidedfordev,onlyscoringfordev-testoutput–  Fortrainingdataselec0on,systemdevelopmentandtuning

•  12hoursofnewspeechdataduringtheevalua0onperiod–  Far-fieldspeech(eval)fromnoisy,reverberantrooms–  Single-microphoneormul0-microphonecondi0ons

•  Worderrorrateisthemeasureofperformance–  Single-microphonesubmissionsweredueon02/18/2015–  Resultswereofficiallyannouncedon09/10/2015

Page 68: Building a State-of-the-Art ASR System with Kaldi

ExamplesofASpIREAudio

•  Typicalsample–  SuggestedbyDr.MaryHarper

•  Almostmanageable–  Easyforhumans,26%errorsforASR

•  Somewhathard–  Easyforhumans,41%errorsforASR

•  Muchharder– Noteasyforhumans,60%errorsforASR

•  *#@!!#%#%^^–  Veryhardforhumans,noASRoutput

Page 69: Building a State-of-the-Art ASR System with Kaldi

KaldiASRImprovementsforASpIRE

•  Timedelayneuralnetworks(TDNN)–  Awaytodealwithlongacous0c-phone0ccontext–  Astructuredalterna0vetodeep/recurrentneuralnets

•  Dataaugmenta$onwithsimulatedreverbera0ons–  Awaytomi0gatechanneldistor0onsnotseenintraining–  Aformofmul0-condi0ontrainingofASRmodels

•  i-vectorbasedspeaker&environmentadapta$on–  Awaytodealwithspeaker&channelvariability–  Adapted[withatwist]fromSpeakerIDsystems

Page 70: Building a State-of-the-Art ASR System with Kaldi

KaldiASRImprovements,ASpIRE++

•  Pronuncia$onandinter-wordsilencemodeling–  Inspiredbypronuncia0on-prosodyinterac0ons–  Asimplecontext-dependentmodelofinter-wordsilence

•  Recurrentneuralnetworklanguagemodels(RNNLM)–  A(known)waytomodellong-rangeworddependencies–  Incorporatedpost-submissionintoJHUASpIREsystem

•  OngoingKaldiinves0ga0onsthatholdpromise–  Semi-superviseddiscrimina0vetrainingof(T)DNNs–  Longshort-termmemory(LSTM)acous0cmodels–  Connec0onisttemporalclassifica0on(CTC)models

Page 71: Building a State-of-the-Art ASR System with Kaldi

TimeDelayNeuralNetworks(SeeourpaperatINTERSPEECH2015fordetails)

Page 72: Building a State-of-the-Art ASR System with Kaldi

A28YearOldIdea,Resurrected

AlexWaibel,KevinLang,etal(1987)t-13 t+9

t+7

t+5 t-10

t-7 t+2

t-1 t-4

t-8 t-5 t-2 t+1 t+4 t-11

-7 +2

-3 +3 -3 +3

-1 +2 -1 +2 -1 +2 -1 +2

Layer 4

Layer 3

Layer 2

Layer 1 +2 -2

OurTDNNArchitecture(2015)

1

5

2

3

4

Page 73: Building a State-of-the-Art ASR System with Kaldi

ImprovedASRonSeveralDataSets

•  Consistent5-10%reduc0oninworderrorrate(WER)overDNNsonmostdatasets,includingconversa0onalspeech.

•  TDNNtrainingspeedsareonparwithDNN,andnearlyanorderofmagnitudefasterthanRNN

StandardASRTestSets Size DNN TDNN Rel.Δ

WallStreetJournal 80hrs 6.6% 6.2% 5%

TED-LIUM 118hrs 19.3% 17.9% 7%

Switchboard 300hrs 15.5% 14.0% 10%

LibriSpeech 960hrs 5.2% 4.8% 7%

FisherEnglish 1800hrs 22.2% 21.0% 5%

ASpIRE(FisherTraining) 1800hrs 47.7% 47.6%

Page 74: Building a State-of-the-Art ASR System with Kaldi

DataAugmenta0onforASRTraining(SeeourpaperatINTERSPEECH2015fordetails)

Page 75: Building a State-of-the-Art ASR System with Kaldi

Simula0ngReverberantSpeechforMul0-condi0on(T)DNNTraining

•  Simulateca5500hoursofreverberant,noisydatafrom1800hoursoftheFisherEnglishCTScorpus–  Replicateeachoftheca21,000conversa0onsides30mes–  Randomlychangethesamplingrate[upto±10%]–  Convolveeachconversa0onsidewithoneof320real-liferoomimpulseresponses(RIR)chosenatrandom

–  Addnoisetothesignal(whenavailablewiththeRIR)•  Generate(T)DNNtraininglabelsfromcleanspeech–  Align“pre-reverb”speechtoca7500CD-HMMstates

•  TrainDNNandTDNNacous0cmodels–  Cross-entropytrainingfollowedbysequencetraining

Page 76: Building a State-of-the-Art ASR System with Kaldi

ResultofDataAugmenta0onAcous$cModel DataAugmenta$on DevWER

TDNNA(230ms) None(1800h,cleanspeech) 47.6%

TDNNA(230ms) +3x(reverbera0on+noise) 31.7%

TDNNB(290ms) +3x(reverbera0on+noise) 30.8%

TDNNA(230ms) +samplingrateperturba0on 31.0%

TDNNB(290ms) +samplingrateperturba0on 31.1%

•  Dataaugmenta0onwithsimulatedreverbera0onisbeneficial–  Likelytobeaveryimportantreasonforrela0velygoodperformance

•  Samplingrateperturba0ondidn’thelpmuchonASpIREdata•  SequencetraininghelpedreduceWERonthedevset

–  RequiredmodifyingthesMBRtrainingcriteriontorealizegains–  Butthegainsdidnotcarryovertodev-testset

Page 77: Building a State-of-the-Art ASR System with Kaldi

i-vectorsforSpeakerCompensa0on(SeeourpaperatINTERSPEECH2015fordetails)

Page 78: Building a State-of-the-Art ASR System with Kaldi

Usingi-vectorsInsteadoffMLLRandusingunnormalizedMFCCstocomputei-vectors

•  100-dimi-vectorsareappendedtoMFCCinputsoftheTDNN–  i-vectorsarecomputedfromrawMFCCs(i.e.nomeansubtrac0onetc)–  UBMposteriorshoweveruseMFCCsnormalizedovera6secwindow

•  i-vectorsarecomputedforeachtraininguVerance–  Increasesspeaker-andchannelvariabilityseenintrainingdata–  Maymodeltransientdistor0ons?e.g.movingspeakers,passingcars

•  i-vectorsarecalculatedforeveryca60secoftestaudio–  UBMpriorisweighted10:1topreventovercompensa0on–  Weightofteststa0s0csiscappedat75:1rela0vetoUBMsta0s0cs

SpeakerCompensa$onMethod DevWER

TDNNwithouti-vectors 34.8%

+i-vectors(fromallframes) 33.8%

+i-vectors(fromreliablespeechframes) 30.8%

Page 79: Building a State-of-the-Art ASR System with Kaldi

Pronuncia0onandSilenceProbabili0es(SeeourpaperatINTERSPEECH2015fordetails)

Page 80: Building a State-of-the-Art ASR System with Kaldi

Trigram-likeInter-wordSilenceModel

P s a_b( ) = P s a_( )F s _b( )

F s _b( ) =c sb( )+λ3

c !a *b( )P s !a _( )!a∑ +λ3

P s a_( ) =c as( )+λ2P s( )c a( )+λ2

Page 81: Building a State-of-the-Art ASR System with Kaldi

Is“Prosody”FinallyHelpingSTT?Task TestSet Baseline +Sil/Pron

WSJ Eval92 4.1 3.9

Switchboard Eval2000 20.5 20.0

TED-LIUM Test 18.1 17.9

LibriSpeechTestClean 6.6 6.6

TestOther 22.9 22.5

•  Modelingpronuncia0onandsilenceprobabili0esyieldsmodestbutconsistentimprovementonmanylargevocabularyASRtasks

Pronuncia$on/SilenceProbabili$es DevWER

Noprobabili0esinthelexicon 32.1%

+pronuncia0onprobabili0es 31.6%

+inter-wordsilenceprobabili0es 30.8%

Page 82: Building a State-of-the-Art ASR System with Kaldi

RecurrentNeuralNetworkbasedLanguageModels

(SeeourpaperatINTERSPEECH2010forthefirst“convincing”results)

Page 83: Building a State-of-the-Art ASR System with Kaldi

RNNLMonASpIREDataLanguageModelandRescoringMethod DevWER

4-gramLMandlancerescoring 30.8%

RNN-LMand100-bestrescoring 30.2%

RNN-LMand1000-bestrescoring 29.9%

RNN-LM(4-gramapproxima0on)lancerescoring 29.9%

RNN-LM(6-gramapproxima0on)lancerescoring 29.8%

•  AnRNNLMconsistentlyoutperformstheN-gramLM•  TheKaldilancerescoringappearstocausenolossin

performance–  Approxima0onentailsnot“expanding”thelancetorepresenteachuniquehistoryseparately

–  WhentwopathsmergeinanN-gramlance,onlyones(t)ischosenatrandomandpropagatedforward

Page 84: Building a State-of-the-Art ASR System with Kaldi

TheIARPAASpIRELeaderBoard

Rank Par$cipant DevWER SystemType

1 tsakilidis 27.2% Combina0on

2 rhsiao 27.5% Combina0on

3 vijaypeddin0 27.7% SingleSystem

Page 85: Building a State-of-the-Art ASR System with Kaldi

hVp://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge

Page 86: Building a State-of-the-Art ASR System with Kaldi

hVp://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge

Page 87: Building a State-of-the-Art ASR System with Kaldi

PerformanceonEvalua0onData

Acous$cModel LanguageModel DevWER

TestWER

EvalWER

TDNNB(CEtraining) 4-gram 30.8% 27.7% 44.3%

TDNNB(sMBRtraining) 4-gram 29.1% 28.9% 43.9%

TDNNB(CEtraining) RNN 29.8% 26.5% 43.4%

TDNNB(sMBRtraining) RNN 28.3% 28.2% 43.4%

Par$cipant TestWER SystemType

Kaldi 44.3% SingleSystem

BBN(andothers) 44.3% Combina0on

I2R(Singapore) 44.8% Combina0on

Page 88: Building a State-of-the-Art ASR System with Kaldi

KeystoGoodPerformanceonASpIRE

•  Timedelayneuralnetworks(TDNN)– Dealwellwithlongreverbera0on0mes

•  i-vectorbasedadapta0oncompensa0on– Dealswithspeaker&channelvariability

•  Dataaugmenta0onwithsimulatedreverbera0ons– Dealswithchanneldistor0onsnotseenintraining

•  Pronuncia0onandinter-wordsilenceprobabili0es– Helpfulinadverseacous0ccondi0ons

Page 89: Building a State-of-the-Art ASR System with Kaldi

TheJHUASpIRESystem(SeeourASRU2015paperfordetails)

Page 90: Building a State-of-the-Art ASR System with Kaldi

Semi-supervisedMMITraining(SeeourpaperatINTERSPEECH2015fordetails)

Page 91: Building a State-of-the-Art ASR System with Kaldi

Discrimina0ve(MMI)Training:ahand-waving,mostlycorrectintroduc0on

θ̂ML = argmaxθ

logP Ot Wt ;θ( )t=1

T

∑ KL P̂ Pθ( )

θ̂MMI = argmaxθ

logP Ot Wt;θ( )

P Ot !Wt;θ( )P !Wt( )!Wt

#

$%

&%

'

(%

)%t=1

T

∑ I W ∧O;θ( )

Cross-en

trop

ytraining

Sequ

encetraining

Page 92: Building a State-of-the-Art ASR System with Kaldi

Semi-SupervisedSequenceTraining

•  Sequencetrainingimprovessubstan0allyoverbasiccross-entropytrainingofDNNacous0cmodels

•  Semi-supervisedcross-entropytraining–byaddingunlabeleddata–alsoimprovessubstan0allyoverbasiccross-entropytrainingonlabeleddata

•  Butsemi-supervisedsequencetrainingis“tricky”–  Sensi0vitytoincorrecttranscrip0onseemsgreater–  Confidence-basedfilteringorweigh0ngmustbeapplied–  Empiricalresultsarenotverysa0sfactory

Page 93: Building a State-of-the-Art ASR System with Kaldi

Semi-supervisedSequenceTraining:withoutcomminngtoasingletranscrip0on

•  ViewMMItrainingasminimizingacondi0onalentropy

I W ∧O ;θ( ) =1T

logP Ot Wt ;θ( )P Ot ;θ( )t=1

T

∑ =1T

logP Ot Wt ;θ( )

P Ot #W ; θ )P #W( )(#W∑t=1

T

I W ∧O ;θ( ) = H W( ) − H W O ;θ( ) = H W( ) − 1T

H W Ot ;θ( )t=1

T

•  ThelaVerdoesnotrequirecomminngtoasingleWt–  Wellsuitedforunlabeledspeech–  Entailscompu0ngasumoverallW’sinthelance

Page 94: Building a State-of-the-Art ASR System with Kaldi

Compu0ngLanceEntropyUsingExpecta0onSemi-rings

•  Howtoefficientlycompute

•  Replacearc-probabili0espiwiththepair(pi,pilog{pi})

−H W Ot ;θ( ) = P π( ) logP π( )π∈L∑

Z Ot ;θ( ) = P π( )π∈L∑

Semi-ringElement&Operators (p,p×log{p})

(p1,p1log{p1})+(p2,p2log{p2}) (p1+p2,p1log{p1}+p2log{p2})

(p1,p1log{p1})×(p2,p2log{p2}) (p1p2,p1p2log{p2}+p2p1log{p1})

•  Takeinspira0onfromthecomputa0onof

Page 95: Building a State-of-the-Art ASR System with Kaldi

Semi-supervisedSequenceTraining:KeyDetailsNeededtoMakeitWork

•  ViewtrainingcriterionasMCEinsteadofMMI–  i.e.argminH(W|O;θ)insteadofargmaxI(W∧O;θ)–  EfficientlycomputeH(W|O;θ)forthelance,anditsgradient,viaBaumWelchwithspecialsemi-rings

•  Useseparateoutput(sod-max)layersintheDNNforlabeledandunlabeleddata–  Inspiredbymul0lingualDNNtrainingmethods

•  Useaslightlydifferent“prior”forconver0ngDNNposteriorprobabili0estoacous0clikelihoods

Page 96: Building a State-of-the-Art ASR System with Kaldi

ResultsforSemi-SupervisedMMIonFisherEnglishCTS

DNNTrainingMethod(hoursofspeech) DevWER TestWER

Cross-EntropyTraining(100hlabeled) 32.0 31.2

CE(100hlabeled+250hself-labeled) 30.6 29.8

CE(100hlabeled+250hweighted) 30.5 29.8

SequenceTraining(100hlabeled) 29.6 28.5

SeqTraining(100hlabeled+250hweighted) 29.9 28.8

SeqTraining(100hlabeled+250hMCE) 29.4 28.1

SequenceTraining(350hlabeled) 28.5 27.5

•  Recoversabout40%ofthesupervisedtraininggain–  Inves0ga0onunderwayfor2000hofunlabeledspeech

•  RepeatableresultsonBABELdatasetswith10hsupervisedtraining+50-70hunsupervised

Know

nuseof

unlabe

leddata

Bejeruseof

unlabe

leddata

Page 97: Building a State-of-the-Art ASR System with Kaldi

AdvancedMethods:StayingAheadintheSTTGame

•  STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod

•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)

•  Apreviewofsomeupcomingdevelopments

Page 98: Building a State-of-the-Art ASR System with Kaldi

HeterogeneousTrainingCorpora•  Transcribedspeechfromdifferentcollec0onsarenoteasy

tomergeforSTTtraining–  Genreandspeakingstyledifferences–  Differentchannelcondi0ons–  Slightlydifferenttranscrip0onconven0ons

•  Typicalresult:thecorpusmatchedtotestdatagivesbestSTTresults;othersdon’thelp,some0meshurt!

•  SCALE2015casestudywithPashtoCTS–  Collectedincountry,andtranscribed,bysamevendor–  Roughly80hourseachinthe

•  AppenLILAcorpusandIARPABABELcorpus–  Pronuncia0onlexicontocovertranscripts;samephoneset

Page 99: Building a State-of-the-Art ASR System with Kaldi

AStudyinPashto(Amanuscriptisinprepara0onforfuturepublica0on)

Page 100: Building a State-of-the-Art ASR System with Kaldi

AStudyinPashto•  Transcrip0onsrequire

extensivecross-corpusnormaliza0on

•  Evenaderthat,languagemodelsdon’tbenefitmuchfromcorpuspooling

•  Simplecorpuspoolingdoesn’timproveacous0cmodelingeither

•  DNNswithshared“inner”layersandcorpus-specificinputandoutputlayersworkbest

TrainingData

SingleModel

Interpola$onWeightsLMALMBLMT

InterpolatedModel

TextA 99.2 0.8 0.2 0.0 92.9

TextB 141.9 0.1 0.8 0.1 140.0

TextT 86.7 0.0 0.0 1.0 86.7

Mul$-corpus(A+B)TrainingStrategy

STTWordErrorRatesTestATestBTestT

SharedDNNlayers(except1) 53.2% 47.4% 27.0%

SharedDNNlayers(except2) 51.2% 45.0% 25.4%

+Op0mizedLanguageModel 50.8% 44.8% 25.4%

+Dura0onModeling 50.4% 44.3% 24.8%

DNNTrainingData STTWordErrorRatesTestATestBTestT

Singlecorpus(matched) 55.4% 46.8% 24.8%

Twocorpora(PashtoA+B) 51.9% 48.2% 52.6%

Page 101: Building a State-of-the-Art ASR System with Kaldi

AdvancedMethods:StayingAheadintheSTTGame

•  STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod

•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)

•  Apreviewofsomeupcomingdevelopments

Page 102: Building a State-of-the-Art ASR System with Kaldi

OtherAddi0onsandInnova0ons•  Semi-supervised(MMI)training– Usingunlabeledspeechtoaugmentalimitedtranscribedspeechcorpus

•  Mul0lingualacous0cmodeltraining– Usingother-languagespeechtoaugmentalimitedtranscribedspeechcorpus

•  Removingrelianceonpronuncia0onlexicons– Graphemebasedmodelsandacous0callyaidedG2P

•  Chainmodels–  10%moreaccurateSTT,plus–  3xfasterdecoding,and5x-10xfastertraining

Page 103: Building a State-of-the-Art ASR System with Kaldi

TheGenesisofChainModels

•  Connec0onistTemporalClassifica0on–  Thelatestshinytoyinneuralnetwork-basedacous0cmodelingforSTT(ICASSPandInterSpeech2015)

– NiceSTTimprovementsshownonGoogledatasets– Wehaven’tseenSTTgainsonourdatasets

•  ChainModels–  Inspiredby(butquitedifferentfrom)CTC–  SequencetrainingofNNswithoutCEpre-training– NiceSTTimprovementsoverpreviousbestsystems–  3xdecoding0mespeed-up;5x-10xtrainingspeed-up

Page 104: Building a State-of-the-Art ASR System with Kaldi

2006:ANewKidontheNNetBlock

Page 105: Building a State-of-the-Art ASR System with Kaldi

2015:TheNewKidComesofAge

Page 106: Building a State-of-the-Art ASR System with Kaldi

2015:TheNewKidComesofAge

Page 107: Building a State-of-the-Art ASR System with Kaldi

CTC,Explained…inPicturesFigurefromGravesetal,ICML2006

dh ax s aw n d

Page 108: Building a State-of-the-Art ASR System with Kaldi

CTC,Explained…inPicturesFigurefromGravesetal,ICML2006

dh ax s aw n d

β

ββ

β β β β

Page 109: Building a State-of-the-Art ASR System with Kaldi

DNNversusCTC:STTPerformanceFiguresandTablesfromSaketal,ICASSP2015

DNN Target CE sMBR

LSTM Senone 10.0% 8.9%

BLSTM Senone 9.7% 9.1%

CTC Target CE sMBR

LSTM Phone 10.5% 9.4%

BLSTM Phone 9.5% 8.5%

Page 110: Building a State-of-the-Art ASR System with Kaldi

First,theBadNews…

•  Wehaven’tbeenabletogetCTCmodelstogiveusanyno0ceableimprovementoverourbest(TDNNorLSTM-RNN)modelsonourdata–  Itappearstobeeasiertogetthemtoworkwhenonehasseveral1000hoursoflabeledspeech

– Butwecareaboutlower-resourcescenarios

Page 111: Building a State-of-the-Art ASR System with Kaldi

…andthentheGoodNews

•  Weareabletogetsimilarimprovementsusingadifferentmodel,whichisinspiredbyideasfromtheCTCpapers– Usesimple“1-state”HMMsforeachCDphone– Reduceframeratefrom100Hzto33Hz– Permitslackintheframe-to-statealignment

Page 112: Building a State-of-the-Art ASR System with Kaldi

ChainModelsandLF-MMITraining

•  Anewclassofacous0cmodelsforhybridSTT–  “1-state”HMMforeachcontext-dependentphone–  LSTM/TDNNscomputestateposteriorprobabili0es

•  MFCCsaredown-sampledfrom100Hzto33Hz–  InspiredbyCTC

•  Anewlance-freeMMItrainingmethod–  Improvedparalleliza0on,sequencetrainingonGPUs

•  Largermini-batches,smallerI/Obandwidth– DoesnotrequireCEtrainingbeforeMMItraining– Uses“flexiblelabelalignment”inspiredbyCTC

Page 113: Building a State-of-the-Art ASR System with Kaldi

Discrimina0ve(MMI)Training:ahand-waving,mostlycorrectintroduc0on

θ̂ML = argmaxθ

logP Ot Wt ;θ( )t=1

T

∑ KL P̂ Pθ( )

θ̂MMI = argmaxθ

logP Ot Wt;θ( )

P Ot !Wt;θ( )P !Wt( )!Wt

#

$%

&%

'

(%

)%t=1

T

∑ I W ∧O;θ( )

Page 114: Building a State-of-the-Art ASR System with Kaldi

Lance-FreeMMITraining

•  Denominator(phone)graphcrea0on– Useaphone4-gramlanguagemodel,L–  ComposeH,CandLtoobtaindenominatorgraph

•  ThisFSAisthesameforalluVerances;suitsGPUtraining•  Use(heuris0c)sentence-specificini0alprobabili0es

•  Numeratorgraphcrea0on– Generateaphonegraphusingtranscripts

•  ThisFSAencodesframe-by-framealignmentofHMMstates–  Permitsomealignment“slack”foreachframe/label–  IntersectslackenedFSAwiththedenominatorFSA

Page 115: Building a State-of-the-Art ASR System with Kaldi

Regulariza$on Hub-5‘00WordErrorRate

CrossEntropy L2Norm LeakyHMM Total SWBDN N N 16.8% 11.1%Y N N 15.9% 10.5%N Y N 15.9% 10.4%N N Y 16.4% 10.9%Y Y N 15.7% 10.3%Y N Y 15.7% 10.3%N Y Y 15.8% 10.4%Y Y Y 15.6% 10.4%

Lance-freeMMITraining(cont’d)•  LSTM-RNNstrainedwiththisMMItrainingprocedureare

highlysuscep0bletoover-finng•  Essen0altoregularizetheNNtrainingprocess

–  AsecondoutputlayerforCEtraining–  OutputL2regulariza0on–  UsealeakyHMM

Page 116: Building a State-of-the-Art ASR System with Kaldi

STTResultsforChainModels300hoursofSWBDTrainingSpeech;Hub-5‘00Evalua0onSet

TrainingObjec$ve Model(Size) TotalWER

SWBDWER

Cross-Entropy TDNNA(16.6M) 18.2% 12.5%

CE+sMBR TDNNA(16.6M) 16.9% 11.4%

Lance-freeMMI

TDNNA(9.8M) 16.1% 10.7%

TDNNB(9.9M) 15.6% 10.4%

TDNNC(11.2M) 15.5% 10.2%

LF-MMI+sMBR TDNNC(11.2M) 15.1% 10.0%

•  LF-MMIreducesWERbyca10%-15%rela>ve•  LF-MMIisbeVerthanstandardCE+sMBRtraining(ca8%)•  LF-MMIimprovesveryslightlywithaddi0onalsMBRtraining

Page 117: Building a State-of-the-Art ASR System with Kaldi

ChainModelsandLF-MMITrainingSTTPerformanceonaVarietyofCorpora

CorpusandAudioType

TrainingSpeech

CE+sMBRErrorRate

LF-MMIErrorRate

AMIIHM 80hours 23.8% 22.4%

AMISDM 80hours 48.9% 46.1%

TED-LIUM 118hours 11.3% 12.8%

Switchboard 300hours 16.9% 15.5%

Fisher+SWBD 2100hours 15.0% 13.3%

•  ChainmodelswithLF-MMIreduceWERby6%-11%(rela>ve)•  LF-MMIimprovesabitfurtherwithaddi0onalsMBRtraining•  FL-MMIis5x-10xfastertotrain,3xfastertodecode

Page 118: Building a State-of-the-Art ASR System with Kaldi

ARecapofChainModels

•  Anewclassofacous0cmodelsforhybridSTT–  “1-state”HMMforcontext-dependentphones–  LSTM-RNNacous0cmodels(TDNNalsocompa0ble)

•  Anewlance-freeMMItrainingmethod–  BeVersuitedtousingGPUsforparalleliza0on– DoesnotrequireCEtrainingbeforeMMItraining

•  ImprovedspeedandSTTperformance–  6%-8%rela0veWERreduc0onoverpreviousbest–  5-10ximprovementintraining0me;3xdecoding0me

Page 119: Building a State-of-the-Art ASR System with Kaldi

SummaryofAdvancedMethods:StayingAheadintheSTTGame

•  STTtechnologyisadvancingveryrapidly– Amazon,Apple,Baidu,Facebook,Google,Microsod

•  Kaldileadsandkeepsupwithmajorinnova0ons–  FromSGMMstoDNN(2012)–  From“English”tolow-resourcelanguages(2013)–  FromCPUstoGPUs(2014)–  Fromclose-talkingtofar-fieldmicrophones(2015)–  Fromwell-curatedto“wildtype”corpora(2016)–  ChainmodelsforbeVerSTT,fasterdecoding(2017)

•  andthelistgoeson...

Page 120: Building a State-of-the-Art ASR System with Kaldi

TeamKaldi@JohnsHopkins•  SanjeevKhudanpur•  DanielPovey•  JanTrmal•  GuoguoChen•  PegahGhahremani•  VimalManohar•  VijayadityaPeddin0•  HainanXu•  XiaohuiZhang•  …andseveralothers

Page 121: Building a State-of-the-Art ASR System with Kaldi

KaldiPoints-of-Contact

•  Kaldimailinglist– [email protected]

•  DanielPovey– [email protected]

•  Jan“Yenda”Trmal–  [email protected]

•  SanjeevKhudanpur– [email protected]– 410-516-7024