Abstract
In this paper, we investigate the variation in the performance of a deep learning-based speech synthesis (DLSS) system based on the configuration of output acoustic parameters. Our method is mainly applicable for vocoding-based statistical parametric speech synthesis (SPSS), which has advantages in low-resource scenarios. Given the independence assumption of the source-filter model for the spectral and fundamental frequency F0 parameters, we propose a reliable network architecture for training acoustic parameters. Particularly, the F0 parameter suffers from high fluctuation and an extremely low number of dimensions. To relieve these problems, we introduce a context-window approach. Furthermore, we apply data augmentation to the proposed structure to overcome a lack of training data, which is a frequent issue with multi-speaker TTS systems. Experimental results confirm the superiority of the proposed algorithm over conventional ones in both single-speaker and multi-speaker TTS setups.
Original language | English |
---|---|
Title of host publication | 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 618-622 |
Number of pages | 5 |
ISBN (Electronic) | 9781728132488 |
DOIs | |
Publication status | Published - 2019 Nov |
Event | 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 - Lanzhou, China Duration: 2019 Nov 18 → 2019 Nov 21 |
Publication series
Name | 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 |
---|
Conference
Conference | 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 |
---|---|
Country/Territory | China |
City | Lanzhou |
Period | 19/11/18 → 19/11/21 |
Bibliographical note
Publisher Copyright:© 2019 IEEE.
All Science Journal Classification (ASJC) codes
- Information Systems