Inference in Switching Linear Dynamical Systems Applied to Noise Robust Speech Recognition of Isolated Digits

Type of publication:	Idiap-RR
Citation:	mesot:rr08-35
Number:	Idiap-RR-35-2008
Year:	2008
Institution:	IDIAP Research Institute
Abstract:	Real world applications such as hands-free dialling in cars may have to perform recognition of spoken digits in potentially very noisy environments. Existing state-of-the-art solutions to this problem use feature-based Hidden Markov Models~(HMMs,',','), with a preprocessing stage to clean the noisy signal. However, the effect that the noise has on the induced HMM features is difficult to model exactly and limits the performance of the HMM system. An alternative to feature-based HMMs is to model the clean speech waveform directly, which has the potential advantage that including an explicit model of additive noise is straightforward. One of the most simple model of the clean speech waveform is the autoregressive~(AR) process. Being too simple to cope with the nonlinearity of the speech signal, the AR~process is generally embedded into a more elaborate model, such as the Switching Autoregressive HMM~(SAR-HMM). In this thesis, we extend the SAR-HMM to jointly model the clean speech waveform and additive Gaussian white noise. This is achieved by using a Switching Linear Dynamical System~(SLDS) whose internal dynamics is autoregressive. On an isolated digit recognition task where utterances have been corrupted by additive Gaussian white noise, the proposed~SLDS outperforms a state-of-the-art HMM system. For more natural noise sources, at low signal to noise ratios~(SNRs,',','), it is also significantly more accurate than a feature-based HMM~system. Inferring the clean waveform from the observed noisy signal with a~SLDS is formally intractable, resulting in many approximation strategies in the literature. In this thesis, we present the Expectation Correction~(EC) approximation. The algorithm has excellent numerical performance compared to a wide range of competing techniques, and provides a stable and accurate linear-time approximation which scales well to long time series such as those found in acoustic modelling. A fundamental issue faced by models based on AR~processes is that they are sensitive to variations in the amplitude of the signal. One way to overcome this limitation is to use Gain Adaptation~(GA) to adjust the amplitude by maximising the likelihood of the observed signal. However, adjusting model parameters without constraint may lead to overfitting when the models are sufficiently flexible. In this thesis, we propose a statistically principled alternative based on an exact Bayesian procedure in which priors are explicitly defined on the parameters of the underlying AR~process. Compared to~GA, the Bayesian approach enhances recognition accuracy at high~SNRs, but is slightly less accurate at low~SNRs.
Userfields:	ipdmembership={speech},
Keywords:
Projects	Idiap
Authors	Mesot, Bertrand
Crossref by	Mesot_THESIS_2008
Added by:	[UNK]
Total mark:	0
Attachments
mesot-idiap-rr-08-35.pdf
Notes

processing time: 0.0003 seconds.