Multimodal Prosody Modeling: A Use Case for Multilingual Sentence Mode Prediction

Type of publication:	Conference paper
Citation:	Vlasenko_INTERSPEECH_2025
Publication status:	Accepted
Booktitle:	Proceedings of Interspeech
Year:	2025
Abstract:	Prosody modeling has garnered significant attention from the speech processing community. Recent developments in multilingual latent spaces for representing linguistic and acoustic information have become a new trend in various research directions. Therefore, we decided to evaluate the ability of multilingual acoustic neural embeddings and knowledge-based features to preserve sentence-mode-related information at the suprasegmental level. For linguistic information modeling, we selected neural embeddings based on word- and phoneme-level latent space representations. The experimental study was conducted using Italian, French, and German audiobook recordings, as well as emotional speech samples from EMO-DB. Both intra- and inter-language experimental protocols were used to assess classification performance for uni- and multimodal (early fusion approach) features. For comparison, we used a sentence mode prediction system built on top of automatically generated WHISPER-based transcripts.
Main Research Program:	AI for Everyone
Additional Research Programs:	Human-AI Teaming AI for Everyone
Keywords:	emotional prosody, Multilingual, Multimodal, sentence mode prediction
Projects:	Idiap IICT
Authors:	Vlasenko, Bogdan Magimai-Doss, Mathew
Added by:	[UNK]
Total mark:	0
Attachments
Vlasenko_INTERSPEECH_2025.pdf
Notes

processing time: 0.0010 seconds.