CONF Rangappa_ICASSP2025_2025/IDIAP Speech Data Selection for Efficient ASR Fine-Tuning using Domain Classifier and Pseudo-Label Filtering Rangappa, Pradeep Zuluaga-Gomez, Juan Madikeri, Srikanth Carofilis, Andrés Prakash, Jeena Burdisso, Sergio Kumar, Shashi Villatoro-Tello, Esaú Iuliia, Nigmatulina Motlicek, Petr S, Karthik Pandia D Ganapathiraju, Aravind EXTERNAL https://publications.idiap.ch/attachments/papers/2025/Rangappa_ICASSP2025_2025.pdf PUBLIC 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025) 2025 https://ieeexplore.ieee.org/document/10888138 URL 10.1109/ICASSP49660.2025.10888138 doi In real-world speech data processing, the scarcity of annotated data and the abundance of unlabelled speech data present a significant challenge. To address this, we propose an efficient data selection pipeline for fine-tuning ASR models by generating pseudo-labels using WhisperX pipeline and selecting efficient labels for fine-tuning. In our work, we propose a domain classifier system developed with a computationally inexpensive TFIDF and classical machine learning algorithm. Later, we filter data from the classifier output using a novel metric that assesses word ratio and perplexity distribution. The filtered pseudo labels are then used for fine-tuning standard encoder- decoder Whisper models and Zipformer. Our proposed data selection pipeline reduces the dataset size by approximately 1/100th while maintaining performance comparable to the full dataset, outperforming random domain-independent selection strategies.