Year: 2022
Submitted to ICASSP 2021
Abstract: Starting from a state-of-the-art speech recognition system as baseline, we investigate the use of sparse autoencoders to fur- ther process the baseline acoustic model outputs (senone likelihoods) to obtain even more reliable senone and phone likelihoods with better classification performance. Inspired by dictionary learning, we use shallow overcomplete sparse autoencoders to project the senone likelihoods to high-dimensional sparse space, which is expected to increase the separability among different speech units while suppressing the distortions. On Augmented Multiparty Interaction (AMI) dataset with both Individual Head-mounted Microphone (IHM) and Single-Distant Microphone (SDM) conditions, we show that classifiers trained on the high-dimensional sparse representations from the sparse autoencoders improve senone and phone classification performance compared to the baseline, especially in noisy (SDM) condition. As for the word recognition, we obtain promising improvements when weighted combi- nation of the senone likelihoods from the baseline acoustic model and from our senone classifier is used for decoding.
Keywords: Deep neural network, high-dimensional sparse representations, sparse autoencoder, speech recognition
Selen Hande Kabil
Hervé Bourlard
