Update cookies preferences
 logo Idiap Research Institute        
 [BibTeX] [Marc21]
Speech Data Selection for Efficient ASR Fine-Tuning using Domain Classifier and Pseudo-Label Filtering
Type of publication: Conference paper
Citation: Rangappa_ICASSP2025_2025
Booktitle: 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
Year: 2025
Month: April
Abstract: In real-world speech data processing, the scarcity of annotated data and the abundance of unlabelled speech data present a significant challenge. To address this, we propose an efficient data selection pipeline for fine-tuning ASR models by generating pseudo-labels using WhisperX pipeline and selecting efficient labels for fine-tuning. In our work, we propose a domain classifier system developed with a computationally inexpensive TFIDF and classical machine learning algorithm. Later, we filter data from the classifier output using a novel metric that assesses word ratio and perplexity distribution. The filtered pseudo labels are then used for fine-tuning standard encoder- decoder Whisper models and Zipformer. Our proposed data selection pipeline reduces the dataset size by approximately 1/100th while maintaining performance comparable to the full dataset, outperforming random domain-independent selection strategies.
Keywords:
Authors Rangappa, Pradeep
Zuluaga-Gomez, Juan
Madikeri, Srikanth
Carofilis, Andrés
Prakash, Jeena
Burdisso, Sergio
Kumar, Shashi
Villatoro-Tello, Esaú
Iuliia, Nigmatulina
Motlicek, Petr
S, Karthik Pandia D
Ganapathiraju, Aravind
Added by: [UNK]
Total mark: 0
Attachments
    Notes