REPORT stephenson03c/IDIAP Speech Recognition with Auxiliary Information Stephenson, Todd Andrew EXTERNAL http://publications.idiap.ch/attachments/reports/2003/rr03-28.pdf PUBLIC Idiap-RR-28-2003 2003 IDIAP To appear as: Stephenson, T. A. (2003). ``Speech Recognition with Auxiliary Information''. Docteur ès Sciences thesis, Swiss Federal Institute of Technology Lausanne (EPFL,',','), Lausanne, Switzerland. Automatic speech recognition (ASR) is a very challenging problem due to the wide variety of the data that it must be able to deal with. Being the standard tool for ASR, hidden Markov models (HMMs) have proven to work well for ASR when there are controls over the variety of the data. Being relatively new to ASR, dynamic Bayesian networks (DBNs) are more generic models with algorithms that are more flexible than those of HMMs. Various assumptions can be changed without modifying the underlying algorithm and code, unlike in HMMs; these assumptions relate to the variables to be modeled, the statistical dependencies between these variables, and the observations which are available for certain of the variables. The main objective of this thesis, therefore, is to examine some areas where DBNs can be used to change HMMs' assumptions so as to have models that are more robust to the variety of data that ASR must deal with. HMMs model the standard observed features by jointly modeling them with a hidden discrete state variable and by having certain restraints placed upon the states and features. Some of the areas where DBNs can generalize this modeling framework of HMMs involve the incorporation of even more ``auxiliary'' variables to help the modeling which HMMs typically can only do with the two variables under certain restraints. The DBN framework is more flexible in how this auxiliary variable is introduced in different ways. First, this auxiliary information aids the modeling due to its correlation with the standard features. As such, in the DBN framework, we can make it directly condition the distribution of the standard features. Second, some types of auxiliary information are not strongly correlated with the hidden state. So, in the DBN framework we may want to consider the auxiliary variable to be conditionally independent of the hidden state variable. Third, as auxiliary information tends to be strongly correlated with its previous values in time, I show DBNs using discretized auxiliary variables that model the evolution of the auxiliary information over time. Finally, as auxiliary information can be missing or noisy in using a trained system, the DBNs can do recognition using just its prior distribution, learned on auxiliary information observations during training. I investigate these different advantages of DBN-based ASR using auxiliary information involving articulator positions, estimated pitch, estimated rate-of-speech, and energy. I also show DBNs to be better at incorporating auxiliary information than hybrid HMM/ANN ASR, using artificial neural networks (ANNs). I show how auxiliary information is best introduced in a time-dependent manner. Finally, DBNs with auxiliary information are better able than standard HMM approaches to handling noisy speech; specifically, DBNs with hidden energy as auxiliary information --- that conditions the distribution of the standard features and which is conditionally independent of the state --- are more robust to noisy speech than HMMs are.