CONF
Marelli_ICASSP2019_2019/IDIAP
An End-to-end Network to Synthesize Intonation Using a Generalized Command Response Model
Marelli, François
Schnell, Bastian
Bourlard, Hervé
Dutoit, T.
Garner, Philip N.
Digital IIR Filters
Fujisaki Model
neural networks
Prosody Modelling
speech synthesis
EXTERNAL
https://publications.idiap.ch/attachments/papers/2019/Marelli_ICASSP2019_2019.pdf
PUBLIC
https://publications.idiap.ch/index.php/publications/showcite/Marelli_Idiap-RR-05-2019
Related documents
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Brighton, United Kingdom
2019
IEEE
7040-7044
https://ieeexplore.ieee.org/document/8683815
URL
10.1109/ICASSP.2019.8683815
doi
The generalized command response (GCR) model represents intonation as a superposition of muscle responses to spike command signals. We have previously shown that the spikes can be predicted by a two-stage system, consisting of a recurrent neural network and a post-processing procedure, but the responses themselves were fixed dictionary atoms. We propose an end-to-end neural architecture that replaces the dictionary atoms with trainable second-order recurrent elements analogous to recursive filters. We demonstrate gradient stability under modest conditions, and show that the system can be trained by imposing temporal sparsity constraints. Subjective listening tests demonstrate that the system can synthesize intonation with high naturalness, comparable to state-of-the-art acoustic models, and retains the physiological plausibility of the GCR model.
REPORT
Marelli_Idiap-RR-05-2019/IDIAP
AN END-TO-END NETWORK TO SYNTHESIZE INTONATION USING A GENERALIZED COMMAND RESPONSE MODEL
Marelli, François
Schnell, Bastian
Bourlard, Hervé
Dutoit, T.
Garner, Philip N.
Digital IIR Filters
Fujisaki Model
neural networks
Prosody Modelling
speech synthesis
EXTERNAL
https://publications.idiap.ch/attachments/reports/2018/Marelli_Idiap-RR-05-2019.pdf
PUBLIC
Idiap-RR-05-2019
2019
Idiap
May 2019
The generalized command response (GCR) model represents intonation as a superposition of muscle responses to spike command signals. We have previously shown that the spikes can be predicted by a two-stage system, consisting of a recurrent neural network and a post-processing procedure, but the responses themselves were fixed dictionary atoms. We propose an end-to-end neural architecture that replaces the dictionary atoms with trainable second-order recurrent elements analogous to recursive filters. We demonstrate gradient stability under modest conditions, and show that the system can be trained by imposing temporal sparsity constraints. Subjective listening tests demonstrate that the system can synthesize intonation with high naturalness, comparable to state-of-the-art acoustic models, and retains the physiological plausibility of the GCR model.