Table of Links
2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
3.5 Model Architecture
3.5.1 Text-to-Vec
The content encoder of the TTV consists of 16 layers of noncausal WaveNet with a hidden size of 256 and a kernel size of five. Content decoder consists of eight layers of non-causal WaveNet with hidden size of 512 and kernel size of five. The text encoder is composed of three unconditional Transformer networks and three prosody-conditional Transformer networks with a kernel size of nine, a hidden size of 256 and a filter size of 1024. We utilize a dropout rate of 0.2 for text encoder. T-Flow consists of four residual coupling layers which are composed of a preConv, three Transformer blocks, and postConv. We adopt convolutional neural networks with a kernel size of 5 in Transformer blocks for encoding adjacent information and AdaLN-Zero for better prosody style adaptation. We utilize a hidden size of 256, a filter size of 1024, and four attention heads for T-Flow. We utilize a dropout rate of 0.1 for T-Flow. For the pitch predictor, we utilize the source generator with the same structure as that of HAG.
3.5.2 SpeechSR
The SpeechSR consists of a single AMP block with an initial channel of 32 without an upsampling layer. We utilize an NN upsampler for upsampling the hidden representations. For the discriminator, we utilize the MPD with the period of [2,3,5,7,11] and MS-STFTD with six different sizes of window ([4096,2048,1024,512,256,128]). Additionally, we utilize DWTD which has four sub-band discriminators.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.
