logo Idiap Research Institute        
 [BibTeX] [Marc21]
Diffusion transformer for adaptive text-to-speech
Type of publication: Conference paper
Citation: Chen_SSW12_2023
Publication status: Accepted
Booktitle: Proc. 12th ISCA Speech Synthesis Workshop (SSW 12)
Year: 2023
Month: August
DOI: 10.21437/SSW.2023-25
Abstract: Given the success of diffusion in synthesizing realistic speech, we investigate how diffusion can be included in adaptive text-to-speech systems. Inspired by the adaptable layer norm modules for Transformer, we adapt a new backbone of diffusion models, Diffusion Transformer, for acoustic modeling. Specifically, the adaptive layer norm in the architecture is used to condition the diffusion network on text representations, which further enables parameter-efficient adaptation. We show the new architecture to be a faster alternative to its convolutional counterpart for general text-to-speech, while demonstrating a clear advantage on naturalness and similarity over the Transformer for few-shot and few-parameter adaptation. In the zero-shot scenario, while the new backbone is a decent alternative, the main benefit of such an architecture is to enable high-quality parameter-efficient adaptation when finetuning is performed.
Projects Idiap
Authors Chen, Haolin
Garner, Philip N.
Added by: [UNK]
Total mark: 0
  • Chen_SSW12_2023.pdf