CONF
Thorbecke_EMNLP_2024/IDIAP
Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper
Thorbecke, Iuliia
Zuluaga-Gomez, Juan
Villatoro-Tello, Esaú
Kumar, Shashi
Rangappa, Pradeep
Burdisso, Sergio
Motlicek, Petr
S, Karthik Pandia D
Ganapathiraju, Aravind
pseudo-labelling
shallow fusion
streaming transducer
EXTERNAL
https://publications.idiap.ch/attachments/papers/2024/Thorbecke_EMNLP_2024.pdf
PUBLIC
https://publications.idiap.ch/index.php/publications/showcite/Iuliia_Idiap-RR-10-2024
Related documents
Findings of the Association for Computational Linguistics: EMNLP 2024
2024
Association for Computational Linguistics (ACL)
Miami, Florida, USA
16747–16762
https://aclanthology.org/2024.findings-emnlp.976/
URL
10.18653/v1/2024.findings-emnlp.976
doi
The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages from CommonVoice and propose multiple heuristics to filter out hallucinated PLs.
REPORT
Iuliia_Idiap-RR-10-2024/IDIAP
Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper
Iuliia, Thorbecke
Zuluaga-Gomez, Juan
Villatoro-Tello, Esaú
Kumar, Shashi
Rangappa, Pradeep
Burdisso, Sergio
Motlicek, Petr
S, Karthik Pandia D
Ganapathiraju, Aravind
EXTERNAL
https://publications.idiap.ch/attachments/reports/2024/Iuliia_Idiap-RR-10-2024.pdf
PUBLIC
Idiap-RR-10-2024
2024
Idiap
October 2024
accepted to EMNLP
The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages from CommonVoice and propose multiple heuristics to filter out hallucinated PLs.