very deep convolutional neural networks for noise …...convolutional layers with carefully tuned...

60
Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition Yanmin Qian, et al. “Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition.” IEEE Transactions on Audio, Speech, and Language Processing. Accepted for publication for a future issue. Presented by Peidong Wang 09/09/2016 1

Upload: others

Post on 18-Apr-2020

43 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

VeryDeepConvolutionalNeuralNetworksforNoiseRobust

SpeechRecognition

Yanmin Qian,etal.“VeryDeepConvolutionalNeuralNetworksforNoiseRobustSpeechRecognition.” IEEETransactionsonAudio,Speech,andLanguageProcessing.Acceptedforpublicationforafutureissue.

Presented by PeidongWang09/09/2016

1

Page 2: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

2

Page 3: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

3

Page 4: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Abstract

• ASR: PreviousattemptsincreasingthenumberofCNNlayersfrom2to3gaveadegradation.• CV:Recentworkinimageshowsthattheaccuracyofimageclassificationcanbeimprovedbyincreasingthenumberofconvolutionallayerswithcarefullytunedarchitecture.• ASR:VeryDeepConvolutionalNeuralNetworksusesupto10convolutionallayersandgetsaWERof8.81%onAurora4,whichisthebestpublishedresult.

4

Page 5: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

5

Page 6: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ReviewofConvolutionalNeuralNetworks

• AConventionalConvolutionalNeuralNetwork(CNN)

6

From:SlidesinCSE5526NeuralNetworks

Page 7: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ReviewofConvolutionalNeuralNetworks

• ConvolutionandPooling(Subsampling)

7

Page 8: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

8

Page 9: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension• Atypicalsizeofinputfeaturesinspeechrecognitionis11x40,where11denotesthenumberofframesinawindow,40denotesthedimensionofFBankfeatures.[*]

• Usingthiscontextwindowsize,convolutionscanbeperformedintime5timeswithafiltersizeof3,asinthefollowingfigure(vd6).

9

[*]addedbythepresenter

Page 10: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension(cont’d)

10

Page 11: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension(cont’d)• InVeryDeepConvolutionalNeuralNetworks(VDCNNs),thecontextwindowsizeisextendedto17(andfurtherto21),whichallows8(and10)convolutionstobeperformedintime,respectively.

11

Page 12: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension(cont’d)

12

Page 13: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension(cont’d)

13

Page 14: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension• Basedon40-dimFBankfeatures,atmost6convolutionsand2poolingscanbeperformedinfrequency,leadingtothevd6model.• InVDCNN,theFBankfeaturesareextendedto64-dim,sothat4moreconvolutionscanbeperformedinfrequency.

14

Page 15: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)

15

Page 16: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)• Finallytheinputextensionisperformedinbothtimeandfrequency,leadingtoa17x64input.Theresultingmodelisnamedvd10.

16

Page 17: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)

17

Page 18: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)• Thefull-ext modelfurtherextendsthenumberoftimeframesto21sothat2moreconvolutionoperationscanbeperformedintime,giving10convolutionoperationsinbothtimeandfrequency.

18

Page 19: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)

19

Page 20: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)• Toconfirmthattheperformancegainisnotfromtheextendedinputfeatures,amodelwiththesamewiderinputfeatures(17x64)butshallowconvolutionallayersisdeveloped.

20

Page 21: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)

21

Page 22: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PoolinginTime• YoumayhavenoticedthattheVDCNNmodelsallusepoolinginfrequencyanddonopoolingintime.• Toinvestigatewhetherpoolingintimeishelpful,vd10-tpoolisdesigned.

22

Page 23: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PoolinginTime(cont’d)

23

Page 24: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PoolinginTime(cont’d)

24

Page 25: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps• InmostworkonCNNsforspeechrecognition,theconvolutionsareperformedwithoutpadding.• Paddingcansavethesizeoffeaturemapsandbetterutilizetheborderinformation.

25

Page 26: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)

26

Page 27: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)•Modelvd10-fpadpadsonlyinfrequency,allowingmorepoolingoperationsinfrequency.

27

Page 28: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)

28

Page 29: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)• Paddinginbothdimensionsisalsoapplied,whichisindicatedasvd10-fpad-tpad.• Inthismodel,consideringthatpoolingisanecessaryapproachtoreducethefeaturemapsize,poolingintimeisalsoapplied.

29

Page 30: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)

30

Page 31: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)

31

Page 32: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• CompleteFigure

32

Page 33: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• CompleteFigure(cont’d)

33

Page 34: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps• VDCNNsuseonechannelfeaturemapasinput,i.e.thestaticFBankfeature.•Mostworkinspeechrecognition,however,usesthree-channelfeatures(static,∆,and∆∆).• ThenumberofinputchannelsarecomparedforVDCNN.

34

Page 35: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)

35

Page 36: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Itisinterestingtofindthat1channelbaseVDCNNsarebetterthanthemodelsusing3channels.• OnepossibleexplanationwouldbethattheinformationinthedynamicfeaturesmaybebetterextractedfromtherawstaticfeaturesdirectlybyVDCNN.

36

Page 37: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Anotherexplanationmaybeasfollows.

37

Page 38: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

•ModelParameterSize• ItisobservedthatalthoughthenumberofconvolutionallayersisincreasedsignificantlyintheproposedVDCNN,thetotalparametersizeissmallerthanthebaselineCNNandDNN.

38

Page 39: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

•ModelParameterSize(cont’d)

39

Page 40: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ConvergenceofVeryDeepCNNs• TheVDCNNconvergesfasterthanothermodeltypes,intermsofthenumberofepochs[*].• Accordingly,althoughVDCNNsneedmorecomputationsineachiteration(9.5timesmorecomputationscomparedtothebaselineCNN),theVDCNNstakecomparabletimeformodeltraining.

40

[*]addedbythepresenter

Page 41: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ConvergenceofVeryDeepCNNs(cont’d)

41

Page 42: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs

42

Page 43: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TobetterunderstandhowVDCNNprocessesnoisyspeech,eachcondition(A,B,CorD)ofthisframeispropagatedthroughthebestperformingmodelvd10-fpad-tpad.• Theoutputsofthe1st convolutionallayerandthe6thconvolutionallayerforA,B,CandDareplottedinthenextfigures.

43

Page 44: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)

44

Page 45: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• Tofurtherverifytheobservation,thedifferencesbetweennoisyfeaturemapsandcleanfeaturemapsaremeasuredforallconvolutionallayers.• Usingdatainthetest,wecomputetheaveragedmeansquareerror(MSE)toevaluatethedifferencesbetweenthethreenoisyconditionsandthecleancondition.

45

Page 46: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesafteralloperationsareshowbelow.

46

Page 47: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesfordifferentCNNmodels.

47

Page 48: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

48

Page 49: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• ExperimentalSetup• TheGMM-HMMsystemisbuiltwithKaldi.• Allneuralnetworkmodels,includingDNN/CNN/LSTM,aretrainedusingCNTK.• ThestandardtestingpipelineinKaldirecipesareusedfordecodingandscoring.• Asimilarstructure(IBM-VGG)designedbyresearchersinIBMandNYUisalsoconstructedforcomparison.

49

Page 50: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAurora4• Aurora4isamediumvocabularytaskbasedontheWallStreetJournal(WSJ0).• Trainingsetscontain14276utterances.• Fourconditions,A,B,CandD,asmentionedbefore.

50

Page 51: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAurora4(cont’d)

51

Page 52: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI• AMIcorpuscontainsaround100hoursofmeetingrecords.• Thesignalwascapturedandsynchronizedwithmultiplemicrophonessuchasindividualheadmicrophones(IHM,close-talk)andmicrophonearrays(singledistantmicrophone(SDM)andmultipledistantmicrophones(MDM)).•MDMwasprocessedbyastandardbeamformingalgorithmtogenerateasinglechanneldataset.

52

Page 53: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)• Thesizeofinputfeaturesisinvestigated.

53

Page 54: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)• Theeffectofotherdesignsarealsoinvestigated.

54

Page 55: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)• TobetterexplainthesuperiorityofVDCNNs,weusesomerelatedfeaturemaps.

55

Page 56: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)• Onesamesinglesynchronizedframeispropagated.

56

Page 57: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)

57

Page 58: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

58

Page 59: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Conclusion

• FeaturesofVDCNN• Thesizesoffiltersandpoolingtemplatesaresmall.• Theinputfeaturemapsarelarge.• Otherdesignsuchaspoolingintime,padding,andinputfeaturemapsselectionareadjusted.• OnAurora4,itachievesaWERof8.81%(state-of-art).• OnAMI,itsaccuracyiscompetitivetoanLSTM.

59

Page 60: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Thank You!

60