Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers

We use cookies

This website uses cookies and other tracking technologies to improve your browsing experience for the following purposes: to enable basic functionality of the website, to provide a better experience on the website, to measure your interest in our products and services and to personalize marketing interactions, to deliver ads that are more relevant to you.

[BibTeX] [Marc21]

Type of publication:	Conference paper
Citation:	Kumar_ICASSP2024_2024
Publication status:	Accepted
Booktitle:	Proceedings of the 49th IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2024
Year:	2024
Month:	April
Pages:	12592-12596
Publisher:	IEEE
Location:	Seoul, Republic of Korea
ISSN:	2379-190X
ISBN:	979-8-3503-4485-1
URL:	https://ieeexplore.ieee.org/do...
DOI:	https://doi.org/10.1109/ICASSP48485.2024.10446130
Abstract:	Traditionally, automatic speech recognition (ASR) and speaker change detection (SCD) systems have been independently trained to generate comprehensive transcripts accompanied by speaker turns. Recently, joint training of ASR and SCD systems, by inserting speaker turn tokens in the ASR training text, has been shown to be successful. In this work, we present a multitask alternative to the joint training approach. Results obtained on the mix-headset audios of AMI corpus show that the proposed multitask training yields an absolute improvement of 1.8% in coverage and purity based F1 score on SCD task without ASR degradation. We also examine the trade-offs between the ASR and SCD performance when trained using multitask criteria. Additionally, we validate the speaker change information in the embedding spaces obtained after different transformer layers of a self-supervised pre-trained model, such as XLSR-53, by integrating an SCD classifier at the output of specific transformer layers. Results reveal that the use of different embedding spaces from XLSR-53 model for multitask ASR and SCD is advantageous.
Keywords:	F1 score, multitask learning, Speaker change detection, speaker turn detection, speech recognition
Projects	UNIPHORE ELOQUENCE
Authors	Kumar, Shashi Madikeri, Srikanth Iuliia, Nigmatulina Villatoro-Tello, Esaú Motlicek, Petr S, Karthik Pandia D Dubagunta, S. Pavankumar Ganapathiraju, Aravind
Added by:	[UNK]
Total mark:	0
Attachments

Notes

processing time: 0.0003 seconds.