Update cookies preferences
 logo Idiap Research Institute        
 [BibTeX] [Marc21]
XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models
Type of publication: Conference paper
Citation: Kumar_ICASSP2025_2025
Publication status: Accepted
Booktitle: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Year: 2025
Month: April
Publisher: IEEE
Location: Hyderabad, India
Crossref: Kumar_Idiap-RR-08-2024:
Abstract: Self-supervised pretrained models exhibit competitive performance in automatic speech recognition (ASR) on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.
Keywords: self-supervised learning, streaming ASR, transformer transducer, XLSR
Projects UNIPHORE
ELOQUENCE
Authors Kumar, Shashi
Madikeri, Srikanth
Zuluaga-Gomez, Juan
Villatoro-Tello, Esaú
Thorbecke, Iuliia
Motlicek, Petr
E, Manjunath K
Ganapathiraju, Aravind
Added by: [UNK]
Total mark: 0
Attachments
  • Kumar_ICASSP2025_2025.pdf
Notes