CONF Kumar_EMNLP2024_2024/IDIAP TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR Kumar, Shashi Madikeri, Srikanth Zuluaga-Gomez, Juan Thorbecke, Iuliia Villatoro-Tello, Esaú Burdisso, Sergio Motlicek, Petr S, Karthik Pandia D Ganapathiraju, Aravind multitask training named entity recognition Speaker change detection speech recognition XLSR-Transducer EXTERNAL http://publications.idiap.ch/attachments/papers/2024/Kumar_EMNLP2024_2024.pdf PUBLIC http://publications.idiap.ch/index.php/publications/showcite/Kumar_Idiap-RR-07-2024 Related documents Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024 Association for Computational Linguistics (ACL) Miami, Florida, USA 20988–20995 https://aclanthology.org/2024.emnlp-main.1167 URL https://doi.org/10.18653/v1/2024.emnlp-main.1167 doi In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp REPORT Kumar_Idiap-RR-07-2024/IDIAP TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR Kumar, Shashi Madikeri, Srikanth Zuluaga-Gomez, Juan Iuliia, Nigmatulina Villatoro-Tello, Esaú Burdisso, Sergio Motlicek, Petr S, Karthik Pandia D Ganapathiraju, Aravind multitask training named entity recognition Speaker change detection speech recognition XLSR-Transducer EXTERNAL http://publications.idiap.ch/attachments/reports/2024/Kumar_Idiap-RR-07-2024.pdf PUBLIC Idiap-RR-07-2024 2024 Idiap August 2024 In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Additionally, we present task transfer learning to a new task within an existing TokenVerse. https://arxiv.org/abs/2407.04444 URL