Diffusion Transformer for Adaptive Text-to-Speech

We use cookies

This website uses cookies and other tracking technologies to improve your browsing experience for the following purposes: to enable basic functionality of the website, to provide a better experience on the website, to measure your interest in our products and services and to personalize marketing interactions, to deliver ads that are more relevant to you.

[BibTeX] [Marc21]

Type of publication:	Conference paper
Citation:	Chen_SSW12_2023
Publication status:	Published
Booktitle:	Proc. 12th ISCA Speech Synthesis Workshop (SSW 12)
Year:	2023
Month:	August
DOI:	10.21437/SSW.2023-25
Abstract:	Given the success of diffusion in synthesizing realistic speech, we investigate how diffusion can be included in adaptive text-to-speech systems. Inspired by the adaptable layer norm modules for Transformer, we adapt a new backbone of diffusion models, Diffusion Transformer, for acoustic modeling. Specifically, the adaptive layer norm in the architecture is used to condition the diffusion network on text representations, which further enables parameter-efficient adaptation. We show the new architecture to be a faster alternative to its convolutional counterpart for general text-to-speech, while demonstrating a clear advantage on naturalness and similarity over the Transformer for few-shot and few-parameter adaptation. In the zero-shot scenario, while the new backbone is a decent alternative, the main benefit of such an architecture is to enable high-quality parameter-efficient adaptation when finetuning is performed.
Keywords:	adaptive layer norm, adaptive TTS, diffusion transformer, speech synthesis
Projects	Idiap NAST
Authors	Chen, Haolin Garner, Philip N.
Added by:	[UNK]
Total mark:	0
Attachments
Chen_SSW12_2023.pdf
Notes

processing time: 0.0003 seconds.