REPORT Kumar_Idiap-RR-08-2024/IDIAP XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models Kumar, Shashi Madikeri, Srikanth Zuluaga-Gomez, Juan Villatoro-Tello, Esaú Iuliia, Nigmatulina Motlicek, Petr E, Manjunath K Ganapathiraju, Aravind EXTERNAL http://publications.idiap.ch/attachments/reports/2024/Kumar_Idiap-RR-08-2024.pdf PUBLIC Idiap-RR-08-2024 2024 Idiap August 2024 Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data for training. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER. https://arxiv.org/abs/2407.04439 URL

<subfield code="a">REPORT</subfield>

</datafield>

<subfield code="a">Kumar_Idiap-RR-08-2024/IDIAP</subfield>

</datafield>

<subfield code="a">XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models</subfield>

</datafield>

<subfield code="a">Kumar, Shashi</subfield>

</datafield>

<subfield code="a">Madikeri, Srikanth</subfield>

</datafield>

<subfield code="a">Zuluaga-Gomez, Juan</subfield>

</datafield>

<subfield code="a">Villatoro-Tello, Esaú</subfield>

</datafield>

<subfield code="a">Iuliia, Nigmatulina</subfield>

</datafield>

<subfield code="a">Motlicek, Petr</subfield>

</datafield>

<subfield code="a">E, Manjunath K</subfield>

</datafield>

<subfield code="a">Ganapathiraju, Aravind</subfield>

</datafield>

<subfield code="i">EXTERNAL</subfield>

<subfield code="u">http://publications.idiap.ch/attachments/reports/2024/Kumar_Idiap-RR-08-2024.pdf</subfield>

<subfield code="x">PUBLIC</subfield>

</datafield>

<subfield code="a">Idiap-RR-08-2024</subfield>

</datafield>

<subfield code="b">Idiap</subfield>

</datafield>

<subfield code="d">August 2024</subfield>

</datafield>

<subfield code="a">Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data for training. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.</subfield>

</datafield>

<subfield code="u">https://arxiv.org/abs/2407.04439</subfield>

</datafield>

</record>

</collection>