Towards Weakly Supervised Acoustic Subword Unit Discovery and Lexicon Development Using Hidden Markov Models

Type of publication:	Idiap-RR
Citation:	Razavi_Idiap-RR-15-2017
Number:	Idiap-RR-15-2017
Year:	2017
Month:	4
Institution:	Idiap
Abstract:	Developing a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be available, particularly for under-resourced languages. An alternative to development of a phonetic lexicon is to automatically derive subword units using acoustic information and generate associated pronunciations. In the literature, this has been mostly studied from the pronunciation variation modeling perspective. In this article, we investigate automatic subword unit derivation from the under-resourced language point of view. Towards that, we present a novel hidden Markov model (HMM) formalism for automatic derivation of subword units and pronunciation generation using only transcribed speech data. In this approach, the subword units are derived from the clustered context-dependent units in a grapheme based system using the maximum-likelihood criterion. The subword unit based pronunciations are then generated either by deterministic or probabilistic learning of the relationship between the graphemes and the acoustic subword units (ASWUs). In this article, we first establish the proposed framework on a well resourced language by comparing it against related approaches in the literature and investigating the transferability of the derived subword units to other domains. We then show the scalability of the proposed approach on real under-resourced scenarios by conducting studies on Scottish Gaelic, a genuinely minority and endangered language, and comparing the approach against state-of-the-art grapheme-based approaches in under-resourced scenarios. Our experimental studies on English show that the derived subword units can not only lead to better ASR systems compared to graphemes, but can also be exploited to build out-of-domain ASR systems. The experimental studies on Scottish Gaelic show that the proposed ASWU-based lexicon development approach retains its dominance over grapheme-based lexicon. Alternately, the proposed approach yields significant gains in ASR performance, even when multilingual resources from resource-rich languages are exploited in the development of ASR systems.
Keywords:	Automatic Speech Recognition, automatic subword unit derivation, Hidden Markov Model, Kullback-Leibler divergence based hidden Markov model, pronunciation generation, under-resourced languages
Projects:	Idiap
Authors:	Razavi, Marzieh Rasipuram, Ramya Magimai-Doss, Mathew
Crossref by	Razavi_SPECOM_2018
Added by:	[ADM]
Total mark:	0
Attachments
Razavi_Idiap-RR-15-2017.pdf (MD5: c26d984b0fffa96a0e19a1264fabfc1e)
Notes

processing time: 0.0002 seconds.