CONF Cernak_ICASSP15_2015/IDIAP Phonological Vocoding Using Artificial Neural Networks Cernak, Milos Potard, Blaise Garner, Philip N. EXTERNAL http://publications.idiap.ch/attachments/papers/2015/Cernak_ICASSP15_2015.pdf PUBLIC http://publications.idiap.ch/index.php/publications/showcite/Cernak_Idiap-RR-04-2015 Related documents IEEE 40th International Conference on Acoustics, Speech and Signal Processing (ICASSP) Brisbane, Australia 2015 IEEE 4844-4848 10.1109/ICASSP.2015.7178891 doi We investigate a vocoder based on artificial neural networks using a phonological speech representation. Speech decomposition is based on the phonological encoders, realised as neural network classifiers, that are trained for a particular language. The speech reconstruction process involves using a Deep Neural Network (DNN) to map phonological features posteriors to speech parameters -- line spectra and glottal signal parameters -- followed by LPC resynthesis. This DNN is trained on a target voice without transcriptions, in a semi-supervised manner. Both encoder and decoder are based on neural networks and thus the vocoding is achieved using a simple fast forward pass. An experiment with French vocoding and a target male voice trained on 21 hour long audio book is presented. An application of the phonological vocoder to low bit rate speech coding is shown, where transmitted phonological posteriors are pruned and quantized. The vocoder with scalar quantization operates at 1 kbps, with potential for lower bit-rate. REPORT Cernak_Idiap-RR-04-2015/IDIAP Phonological vocoding using artificial neural networks Cernak, Milos Potard, Blaise Garner, Philip N. low bit rate speech coding Parametric vocoding phonology EXTERNAL http://publications.idiap.ch/attachments/reports/2014/Cernak_Idiap-RR-04-2015.pdf PUBLIC Idiap-RR-04-2015 2015 Idiap February 2015 We investigate a vocoder based on artificial neural networks using a phonological speech representation. Speech decomposition is based on the phonological encoders, realised as neural network classifiers, that are trained for a particular language. The speech reconstruction process involves using a Deep Neural Network (DNN) to map phonological features posteriors to speech parameters -- line spectra and glottal signal parameters -- followed by LPC resynthesis. This DNN is trained on a target voice without transcriptions, in a semi-supervised manner. Both encoder and decoder are based on neural networks and thus the vocoding is achieved using a simple fast forward pass. An experiment with French vocoding and a target male voice trained on 21 hour long audio book is presented. An application of the phonological vocoder to low bit rate speech coding is shown, where transmitted phonological posteriors are pruned and quantized. The vocoder with scalar quantization operates at 1 kbps, with potential for lower bit-rate.