![Page 1: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/1.jpg)
黄瓒 深度学习解决方案架构师@Nvidia
基于TACOTRON2和WAVEGLOW的端到端语音合成加速方案
![Page 2: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/2.jpg)
2
背景
基于 Tacotron2 和 WaveGlow 的端到端语音合成概述
声码器
介绍 WaveGlow 一种基于深度神经网络的声码器
加速方案
结合 Tacotron2 使用 TensorRT 在 Nvidia GPU 上加速模型推理
AGENDA
![Page 3: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/3.jpg)
3
背景
![Page 4: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/4.jpg)
4
语音合成Text-to-Speech
语音合成语音识别
• 智能家居• 会议记录• 内容检索• 指令识别• 实时翻译• ...
• 车载导航• 电话客服• 虚拟偶像• 有声小说• 睡前故事• ...
技术驱动的,更自然、高效的人机交互方式
![Page 5: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/5.jpg)
5
端到端?
文本: 苏州是个美丽的城市!
复杂的处理过程由单个模型完成,降低语音合成准入门槛=>数据+算力≈?
通过深度神经网络做到更好的语音合成效果=>音质提升,触达更多场景
![Page 6: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/6.jpg)
6
一分为二
特征预测(Tacotron2)
• 字符/音素->梅尔频谱
声码器(WaveGlow)
• 梅尔频谱->声波
字符/音素 序列
声波
中间表示(梅尔频谱)
![Page 7: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/7.jpg)
7
声码器
![Page 8: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/8.jpg)
8
较好的音质+更快的速度?
采样率高: 16KHz=OK, 22KHz=GOOD, 24KHz=BETTER
时序依赖性强: 部分自回归神经网络方法需要若干小时生成十几秒语音
在算法设计上减少自回归结构,增强可并行性->用卷积层完成更多任务
充分发挥硬件性能,针对特定平台做定向优化降低延迟,提高吞吐
![Page 9: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/9.jpg)
9
WAVEGLOW
生成模型?• 对抗生成网络(GAN)• 变分自编码器(VAE)• 基于流的方法(Flow-Based)
声码器?• 传统信号处理方法• 基于神经网络构建
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
https://openai.com/blog/glow/
![Page 10: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/10.jpg)
10
基于流的生成模型
https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html
![Page 11: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/11.jpg)
11
最大似然
![Page 12: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/12.jpg)
12
雅可比矩阵
https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant
![Page 14: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/14.jpg)
14
求逆变换的雅可比矩阵行列式
Flow-based Generative Model by 李宏毅
![Page 18: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/18.jpg)
18
训练
混合精度训练 微调预训练模型
声音数据
![Page 19: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/19.jpg)
19
推理
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/
• 提取权重• 构建网络• 生成Plan• FP32->FP16
• 在线推理
![Page 20: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/20.jpg)
20
加速方案
![Page 22: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/22.jpg)
22
TACOTRON2
Decoder 部分 GPU 函数过于细碎,成为性能瓶颈
![Page 23: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/23.jpg)
23
加速 TACOTRON2
TensorRT支持的层直接转到对应实现
![Page 24: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/24.jpg)
24
加速 TACOTRON2
模型中的其它层通过插件形式接入并实现
在 C++/CUDA 代码的层级做层融合和特定优化工作
Credits to Nvidia DevTech Team for optimizing Tacotron2 on GPU
![Page 25: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/25.jpg)
25
目前取得的加速效果
Tacotron2+WaveGlow on V100
•原始实现:低于十倍实时
•加速后:高于五十倍实时
Accelerate for Deployment
![Page 26: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/26.jpg)
26
引用
J. Shen et al., "Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 4779-4783.
R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 3617-3621.
Oord A, Dieleman S, Zen H, et al. Wavenet: A generative model for raw audio[J]. arXiv preprint arXiv:1609.03499, 2016.
Kingma D P, Dhariwal P. Glow: Generative flow with invertible 1x1 convolutions[C]//Advances in Neural Information Processing Systems. 2018: 10215-10224.
https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/index.html
https://www.youtube.com/watch?v=uXY18nzdSsM
![Page 27: 基于TACOTRON2和WAVEGLOW的 端到端语音合成加速方案 ...R. Prenger, R. Valle and B. Catanzaro, "Waveglow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019](https://reader034.vdocuments.site/reader034/viewer/2022050116/5f4d5d59a719f03fe841b7d5/html5/thumbnails/27.jpg)
谢谢!