Low-Resource Speech Recognition and Understanding for Challenging Applications
Type of publication: | Thesis |
Citation: | Juan_THESIS_2024 |
Year: | 2024 |
School: | EPFL-EDEE |
Abstract: | Automatic speech recognition (ASR) and spoken language understanding (SLU) is the core component of current voice-powered AI assistants such as Siri and Alexa. It involves speech transcription with ASR and its comprehension with natural language understanding (NLU) systems. Traditionally, SLU runs on a cascaded setting, where an in-domain ASR system automatically generates the transcripts with valuable semantic information, e.g., named entities and intents. These components have been generally based on statistical approaches with hand-crafted features. However, current trends have shifted towards large-scale end-to-end (E2E) deep neural networks (DNN), which have shown superior performance on a wide range of SLU tasks. For example, ASR has seen a rapid transition from traditional hybrid-based modeling to encoder-decoder and Transducer-based modeling. Even though there is an undeniable improvement in performance, other challenges have come into play, such as the urgency and need of large-scale supervised datasets; the need of additional modalities, such as contextual knowledge; massive GPU clusters for training large models; or high-performance and robust large models for complex applications. All of this leads to major challenges. This thesis explores solutions to these challenges that arise from complex settings. Specifically, we propose approaches: (1) to overcome the data scarcity on hybrid-based and E2E ASR models, i.e., low-resource applications; (2) for integration of contextual knowledge at decoding and training time, which leads to improved model quality; (3) to fast develop streaming ASR models from scratch for challenging domains without supervised data; (4) to reduce the computational budget required at training and inference time by proposing efficient alternatives w.r.t the state-of-the-art E2E architectures. Similarly, we explore solutions on the SLU domain, including analysis on the optimal representations to perform cascaded SLU, and other SLU tasks aside from intent and slot filing that can be performed in an E2E fashion. Finally, this thesis closes by covering STAC-ST and TokenVerse, two novel architectures that can handle ASR and SLU tasks seamlessly in a single model via special tokens. |
Keywords: | air traffic control communications, Automatic Speech Recognition, conversational speech, end-to-end ASR, Low-Resource ASR, Spoken Language Understanding |
Projects |
Idiap |
Authors | |
Added by: | [UNK] |
Total mark: | 0 |
Attachments
|
|
Notes
|
|
|