XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models
| Type of publication: | Conference paper |
| Citation: | Kumar_ICASSP2025_2025 |
| Publication status: | Published |
| Booktitle: | Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
| Year: | 2025 |
| Month: | April |
| Publisher: | IEEE |
| Location: | Hyderabad, India |
| ISSN: | 2379-190X |
| ISBN: | 979-8-3503-6874-1 |
| Crossref: | Kumar_Idiap-RR-08-2024: |
| URL: | https://ieeexplore.ieee.org/do... |
| DOI: | https://doi.org/10.1109/ICASSP49660.2025.10888110 |
| Abstract: | Self-supervised pretrained models exhibit competitive performance in automatic speech recognition (ASR) on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER. |
| Main Research Program: | Human-AI Teaming |
| Additional Research Programs: |
AI for Everyone |
| Keywords: | self-supervised learning, streaming ASR, transformer transducer, XLSR |
| Projects: |
UNIPHORE ELOQUENCE |
| Authors: | |
| Added by: | [UNK] |
| Total mark: | 0 |
|
Attachments
|
|
|
Notes
|
|
|
|
|