CONF
Kumar_EMNLP2024_2024/IDIAP
TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR
Kumar, Shashi
Madikeri, Srikanth
Zuluaga-Gomez, Juan
Thorbecke, Iuliia
Villatoro-Tello, Esaú
Burdisso, Sergio
Motlicek, Petr
S, Karthik Pandia D
Ganapathiraju, Aravind
multitask training
named entity recognition
Speaker change detection
speech recognition
XLSR-Transducer
EXTERNAL
https://publications.idiap.ch/attachments/papers/2024/Kumar_EMNLP2024_2024.pdf
PUBLIC
https://publications.idiap.ch/index.php/publications/showcite/Kumar_Idiap-RR-07-2024
Related documents
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
2024
Association for Computational Linguistics (ACL)
Miami, Florida, USA
20988–20995
https://aclanthology.org/2024.emnlp-main.1167
URL
https://doi.org/10.18653/v1/2024.emnlp-main.1167
doi
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp
REPORT
Kumar_Idiap-RR-07-2024/IDIAP
TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR
Kumar, Shashi
Madikeri, Srikanth
Zuluaga-Gomez, Juan
Iuliia, Nigmatulina
Villatoro-Tello, Esaú
Burdisso, Sergio
Motlicek, Petr
S, Karthik Pandia D
Ganapathiraju, Aravind
multitask training
named entity recognition
Speaker change detection
speech recognition
XLSR-Transducer
EXTERNAL
https://publications.idiap.ch/attachments/reports/2024/Kumar_Idiap-RR-07-2024.pdf
PUBLIC
Idiap-RR-07-2024
2024
Idiap
August 2024
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Additionally, we present task transfer learning to a new task within an existing TokenVerse.
https://arxiv.org/abs/2407.04444
URL