Controllability and Interpretability in Affective Speech Synthesis
Type of publication: | Thesis |
Citation: | Schnell_THESIS_2022 |
Year: | 2022 |
Month: | February |
School: | École polytechnique fédérale de Lausanne |
URL: | https://infoscience.epfl.ch/re... |
DOI: | 10.5075/epfl-thesis-8794 |
Abstract: | Thanks to Deep Learning Text-To-Speech (TTS) has achieved high audio quality with large databases. But at the same time the complex models lost any ability to control or interpret the generation process. For the big challenge of affective TTS it is infeasible to record databases for all varieties. We belive that affective TTS can only be enabled with models which generalise better to the variability in speech thanks to components which are interpretable by humans. In this thesis we aim to do so by incorporating prior knowledge about speech and the physiological production of it in the TTS framework. We introduce well-established signal processing techniques to Neural Networks. Starting from emphasised speech we investigate the intonation production with a physiological plausible intonation model previously developed at Idiap. In order to generalise the model to longer prosodic sequences, we emulate a Spiking Neural Network (SNN) with a Recurrent Neural Network with trainable second-order recurrent elements trained with a learning function inspired from SNNs. The model synthesises neutral intonation with high naturalness and retains the physiological plausibility and controllability of the intonation model. After intonation, we look into spectral features in the aspect of formant frequencies, which have shown to be indicators of certain emotions. Based on the speaker adaptation technique Vocal Tract Length Normalisation we propose a back-propagatable time-varying All Pass Warp (APW ). Experiments of the APW in few- and zero-shot speaker adaptation shows its effectiveness in low-data regimes. In emotional TTS it is not able to increase expressiveness or audio quality, but our analysis shows that the warping correlates with the level of valence in the emotion. We assume that localisation of emotion within an utterance is necessary for the warping (and an affective TTS model in general). We suggest to extract frame-level emotion intensity with an emotion recogniser in an unsupervised manner. The emotion intensity input is a scalable and interpretable control and is able to increase the amount of correctly perceived emotion by humans. We also propose to increase the quantity of emotional data with a language-agnostic emotional voice conversion model. The model achieves high-quality emotion conversion in German by exploiting large amounts of emotional data in English. To train it we develop a novel contribution to gradient reversal. We are able to demonstrate that Deep Learning TTS models can benefit from well-established signal processing techniques and interpretable low-dimensional controls to improve their generalisability in low-data regimes and/or allow simple controllability without loosing on quality. The developed models also allow interpretability for future physiological and linguistic analysis. While this thesis provides a toolbox of independent controls their combination presents itself as a next step towards a comprehensive framework for affective TTS. |
Keywords: | |
Projects |
MASS NAST |
Authors | |
Added by: | [UNK] |
Total mark: | 0 |
Attachments
|
|
Notes
|
|
|