learning latent representations for speech generation and...
TRANSCRIPT
![Page 1: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/1.jpg)
LearningLatentRepresentationsforSpeechGenerationandTransformation
Wei-Ning Hsu,YuZhang,JamesGlassMITComputerScienceandArtificialIntelligenceLaboratory,Cambridge,MA,USA
Interspeech 2017
![Page 2: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/2.jpg)
WhattoExpectinThisTalk
1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙
𝝈𝒙
p(𝒙|𝒛)
𝝁𝒙
![Page 3: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/3.jpg)
WhattoExpectinThisTalk
1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech
2. Amethodtoassociatelearnedlatentrepresentationswithphysicalattributes,suchasspeakeridentityandlinguisticcontent
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙
𝝈𝒙
p(𝒙|𝒛)
𝝁𝒙
![Page 4: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/4.jpg)
WhattoExpectinThisTalk
1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech
2. Amethodtoassociatelearnedlatentrepresentationswithphysicalattributes,suchasspeakeridentityandlinguisticcontent
3. Simplelatentspacearithmeticoperationstomodifyspeechattributes
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙
𝝈𝒙
p(𝒙|𝒛)
𝝁𝒙
![Page 5: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/5.jpg)
Outline1. Motivation2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. Conclusion
![Page 6: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/6.jpg)
Motivation
• Wewanttolearnagenerativeprocessofspeech1. Whatarethefactorsthataffectspeechgeneration?2. Howdothesefactorsplayaroleinspeechgeneration?3. Howcanweinferthesefactorsfromobservedspeech?
Welcome!
![Page 7: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/7.jpg)
Motivation
• Wewanttolearnagenerativeprocessofspeech1. Whatarethefactorsthataffectspeechgeneration?2. Howdothesefactorsplayaroleinspeechgeneration?3. Howcanweinferthesefactorsfromobservedspeech?
• Whydowewanttolearna generativeprocess?• Synthesis(1,2)• Recognitionandverification(3)• Voiceconversionanddenoising (1,2,3)
Welcome!
![Page 8: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/8.jpg)
Outline1. Motivations2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. Conclusion
![Page 9: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/9.jpg)
GenerativeModelBackgrounds
• “Shallow”generativemodels• HiddenMarkovmodel-Gaussianmixturemodels (HMM-GMMs)
• “Deep”generativemodels• Generativeadversarialnetworks(GANs)
• model𝑝(𝒙|𝒛) andbypasstheinferencemodel(generator/discriminator)
• Auto-regressivemodels (e.g.WaveNets)• model𝑝(𝒙+|𝒙,:+.,) andabstainfromusinglatentvariables
• Variationalautoencoders(VAEs)• learnaninferencemodelandagenerativemodeljointly
![Page 10: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/10.jpg)
VariationalAutoencoders(VAEs)
• Defineaprobabilisticgenerativeprocessbetweenobservation𝒙 andlatentvariable𝒛• 𝑝(𝒛),𝑝(𝒙|𝒛),andq(𝒛|𝒙) aredefinedtobeinsomeparametric family
• Wedefine𝑝(𝒙|𝒛) (decoder) andq(𝒛|𝒙) (encoder) tobediagonalGaussians• Parameters(meanandvariance)aredescribedusingsomeNN
• 𝑝(𝒛) isdefined tobeisotropicGaussianwithunitvariance
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙𝝁𝒙
𝝈𝒙
p(𝒙|𝒛)
![Page 11: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/11.jpg)
ConvolutionalNeuralNetworkArchitecture
𝑝(𝒙|𝒛)𝒛𝑞(𝒛|𝒙)𝒙
Encoder Decoder
Conv1 Conv2 Conv3 FC1 Gauss Sample
FC2 T-conv1 T-conv2 T-conv3(Gauss)
FC3(+reshape
)
*T-convstandsfortransposedconvolution
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙
𝝈𝒙
p(𝒙|𝒛)
𝝁𝒙
𝝁𝒛
𝝈𝒛
𝝁𝒙
𝝈𝒙
![Page 12: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/12.jpg)
ExperimentSetup
• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)
• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)
• TrainingObjective:VariationalLowerBound• Optimizer:Adam
20
![Page 13: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/13.jpg)
ExperimentSetup
• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)
• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)
• TrainingObjective:VariationalLowerBound• Optimizer:Adam
20
8
![Page 14: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/14.jpg)
ExperimentSetup
• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)
• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)
• TrainingObjective:VariationalLowerBound• Optimizer:Adam
8
20
![Page 15: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/15.jpg)
Outline
1. Motivations2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. AudioDemo6. Conclusion
![Page 16: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/16.jpg)
SpeechReconstructionIllustration
• ThetrainedVAEisabletoreconstructspeechsegments
• Examplesfrom10instancesof/aa/,/sh/,and/p/(sampledatcenterofsegment)
/aa/ /sh/ /p/
![Page 17: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/17.jpg)
LatentAttributeRepresentations
• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian
0.3
-0.7
-0.2
1.5
0.4
Encoder Decoder
![Page 18: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/18.jpg)
LatentAttributeRepresentations
• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian
• Wewanttoassociatephysicalattributeswithsomedimensions
0.3
-0.7
SpeakerIdentity
-0.2
1.5
0.4
Phone
Encoder Decoder
![Page 19: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/19.jpg)
LatentAttributeRepresentations
• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian
• Wewanttoassociateparticulardimensionswithdifferentphysicalattributes
0.3
-0.7
SpeakerIdentity
-0.2
1.5
0.4
Phone
Encoder Decoder
0.3
-0.7
0
0
0
LatentSpeakerRepresentation
0
0
-0.2
1.5
0.4
LatentPhoneRepresentation
![Page 20: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/20.jpg)
LatentAttributeRepresentations
• Factorshavenormaldistributions alongtheirassociateddimensions
![Page 21: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/21.jpg)
LatentAttributeRepresentations
• Factorshavenormaldistributions alongtheirassociateddimensions
• Forexample,ifwewanttoestimatethelatentphonerepresentation for/aa/:
0.3
-0.7
SpeakerA
-0.2
1.5
0.4
/aa/
-0.4
1.1
SpeakerB
-0.2
1.5
0.4
/aa/
0.1
0.4
SpeakerC
-0.2
1.5
0.4
/aa/
![Page 22: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/22.jpg)
LatentAttributeRepresentations
• Factorshavenormaldistributions alongtheirassociateddimensions
• Forexample,ifwewanttoestimatethelatentphonerepresentation for/aa/:• Wecanestimatelatentattributebytakingthemeanlatentrepresentations
0.3
-0.7
SpeakerA
-0.2
1.5
0.4
/aa/
-0.4
1.1
SpeakerB
-0.2
1.5
0.4
/aa/
0.1
0.4
SpeakerC
-0.2
1.5
0.4
/aa/
0
0
-0.2
1.5
0.4
Average
LatentPhoneRepresentationfor/aa/
![Page 23: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/23.jpg)
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
![Page 24: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/24.jpg)
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
![Page 25: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/25.jpg)
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
![Page 26: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/26.jpg)
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
![Page 27: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/27.jpg)
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
![Page 28: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/28.jpg)
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
![Page 29: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/29.jpg)
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
![Page 30: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/30.jpg)
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
![Page 31: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/31.jpg)
ArithmeticOperationstoModifyAttributes
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
0.3
-0.7
?
?
?
-0.4
1.1
?
?
?
![Page 32: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/32.jpg)
ArithmeticOperationstoModifyAttributes
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
![Page 33: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/33.jpg)
ArithmeticOperationstoModifyAttributes
SpeakerIdentity
0.3
-0.7
-0.2
1.5
0.4
Phone
Encoder
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
![Page 34: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/34.jpg)
ArithmeticOperationstoModifyAttributes
SpeakerIdentity
0.3
-0.7
-0.2
1.5
0.4
Phone
Encoder
0
0
-0.2
1.5
0.4
0.3
-0.7
0
0
0
--0.4
1.1
0
0
0
+-0.4
1.1
-0.2
1.5
0.4
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
![Page 35: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/35.jpg)
ArithmeticOperationstoModifyAttributes
SpeakerIdentity
0.3
-0.7
-0.2
1.5
0.4
Phone
Encoder Decoder
0
0
-0.2
1.5
0.4
0.3
-0.7
0
0
0
--0.4
1.1
0
0
0
+-0.4
1.1
-0.2
1.5
0.4
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
![Page 36: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/36.jpg)
Outline1. Motivations2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. Conclusion
![Page 37: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/37.jpg)
• GriffinandLimalgorithm isusedforwaveformreconstruction• Iterativelyestimatephase
MagnitudeSpectrogramReconstruction
![Page 38: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/38.jpg)
ModifythePhoneme
• Modify/aa/to/ae/,F2goesup(backvowel->frontvowel)
/aa/ /ae/ /aa/ /ae/ /aa/ /ae/
![Page 39: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/39.jpg)
ModifythePhoneme
• Modify/s/to/sh/,cutoffgoesdown(alveolar->palatalstrident)
/s/ /sh/ /s/ /sh/ /s/ /sh/
![Page 40: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/40.jpg)
ModifytheSpeaker
• Modifyafemale toamale,pitchdecreases
![Page 41: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/41.jpg)
ModifytheSpeaker
• Modifyamaletoafemale,pitchincreases
![Page 42: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/42.jpg)
• Wechooseanutterancefromamalespeaker(madc0)• Modifytoanothermalespeaker(mabc0),andafemalespeaker(fajw0)
• Eachspeakerhasonly8utterancesintheset• ~4s/utterances
• Estimatethelatentspeakerrepresentationusingonly30sofspeech
ModifytheSpeakerforAnEntireUtterance
![Page 43: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/43.jpg)
ModifytheSpeakerforAnEntireUtterance
OriginalSpeaker(top)originalspectrogram,(bottom)reconstructedspectrogram
![Page 44: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/44.jpg)
ModifytheSpeakerforAnEntireUtterance
ConverttoSpeakermabc0(top)originalspectrogram,(bottom)modifiedspectrogram
![Page 45: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/45.jpg)
ModifytheSpeakerforAnEntireUtterance
ConverttoSpeakerfajw0(top)originalspectrogram,(bottom)modifiedspectrogram
![Page 46: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/46.jpg)
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
![Page 47: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/47.jpg)
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
![Page 48: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/48.jpg)
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
![Page 49: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/49.jpg)
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
![Page 50: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/50.jpg)
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
![Page 51: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/51.jpg)
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
![Page 52: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/52.jpg)
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
![Page 53: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/53.jpg)
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
![Page 54: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/54.jpg)
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
![Page 55: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/55.jpg)
Outline1. Motivations2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. Conclusion
![Page 56: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/56.jpg)
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
![Page 57: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/57.jpg)
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.
![Page 58: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/58.jpg)
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.
• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.
![Page 59: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/59.jpg)
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.
• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.
• Wehaveapplied themodificationoperationtodataaugmentation forASRandachievedsignificant improvement fordomainadaptation.(submitted toASRU)
![Page 60: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/60.jpg)
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.
• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.
• Wehaveapplied themodificationoperationtodataaugmentation forASRandachievedsignificant improvement fordomainadaptation. (submitted toASRU)
• Forfuturework,weplantoinvestigatetheuseofVAEonvoiceconversionandspeechde-noisingunder thesettingofnoparalleltrainingdata.
![Page 61: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional](https://reader035.vdocuments.site/reader035/viewer/2022063021/5fe39002e3df6b0f0470c500/html5/thumbnails/61.jpg)
ThanksforListening.Q&A?Paper,slides, samplesandfollow-upworkscanbefoundonhttp://people.csail.mit.edu/wnhsu/