CONF Dubagunta_ICMI’22COMPANION_2022/IDIAP Towards Automatic Prediction of Non-Expert Perceived Speech Fluency Ratings Dubagunta, S. Pavankumar Moneta, Edoardo Theocharopoulos, Eleni Magimai-Doss, Mathew EXTERNAL http://publications.idiap.ch/attachments/papers/2022/Dubagunta_ICMI?22COMPANION_2022.pdf PUBLIC http://publications.idiap.ch/index.php/publications/showcite/Dubagunta_Idiap-RR-11-2021 Related documents ACM International Conference on Multimodal Interaction (ICMI Companion) 2022 https://doi.org/10.1145/3536220.3563689 doi REPORT Dubagunta_Idiap-RR-11-2021/IDIAP Towards Automatic Prediction of Non-Expert Perceived Speech Fluency Ratings Dubagunta, S. Pavankumar Moneta, Edoardo Theocharopoulos, Eleni Magimai-Doss, Mathew articulatory features bag of audio words low level descriptors Perceived fluency raw waveform modelling speech assessment Zero frequency filtering EXTERNAL http://publications.idiap.ch/attachments/reports/2021/Dubagunta_Idiap-RR-11-2021.pdf PUBLIC Idiap-RR-11-2021 2021 Idiap August 2021 Automatic speech fluency prediction has been mainly approached from the perspective of computer aided language learning, where the system tends to predict ratings similar to those of the human experts. Speech fluency prediction, however, can be questioned in a more relaxed social setting, where the ratings arise mostly from non-experts. This paper explores the latter direction, i.e., prediction of non-expert perceived speech fluency ratings, which has not been studied in the speech technology literature, to the best of our knowledge. Toward that, we investigate different approaches, namely, (a) low-level descriptor feature functionals, (b) bag-of-audio word based approach and (c) neural network based end-to-end acoustic modelling approach. Our investigations on speech data collected from 54 speakers and rated by seven non-experts demonstrate that non-expert speech fluency ratings can be systematically predicted, with the best performing system yielding a Pearson's correlation coefficient of 0.66 and a Spearman's correlation coefficient of 0.67 with the median human scores.

</datafield>

<subfield code="a">Dubagunta_ICMI’22COMPANION_2022/IDIAP</subfield>

</datafield>

<subfield code="a">Towards Automatic Prediction of Non-Expert Perceived Speech Fluency Ratings</subfield>

</datafield>

<subfield code="a">Dubagunta, S. Pavankumar</subfield>

</datafield>

<subfield code="a">Moneta, Edoardo</subfield>

</datafield>

<subfield code="a">Theocharopoulos, Eleni</subfield>

</datafield>

<subfield code="a">Magimai-Doss, Mathew</subfield>

</datafield>

<subfield code="i">EXTERNAL</subfield>

<subfield code="u">http://publications.idiap.ch/attachments/papers/2022/Dubagunta_ICMI?22COMPANION_2022.pdf</subfield>

<subfield code="x">PUBLIC</subfield>

</datafield>

<subfield code="u">http://publications.idiap.ch/index.php/publications/showcite/Dubagunta_Idiap-RR-11-2021</subfield>

<subfield code="z">Related documents</subfield>

</datafield>

<subfield code="a">ACM International Conference on Multimodal Interaction (ICMI Companion)</subfield>

</datafield>

</datafield>

<subfield code="a">https://doi.org/10.1145/3536220.3563689</subfield>

</datafield>

</record>

<subfield code="a">REPORT</subfield>

</datafield>

<subfield code="a">Dubagunta_Idiap-RR-11-2021/IDIAP</subfield>

</datafield>

<subfield code="a">Towards Automatic Prediction of Non-Expert Perceived Speech Fluency Ratings</subfield>

</datafield>

<subfield code="a">Dubagunta, S. Pavankumar</subfield>

</datafield>

<subfield code="a">Moneta, Edoardo</subfield>

</datafield>

<subfield code="a">Theocharopoulos, Eleni</subfield>

</datafield>

<subfield code="a">Magimai-Doss, Mathew</subfield>

</datafield>

<subfield code="a">articulatory features</subfield>

</datafield>

<subfield code="a">bag of audio words</subfield>

</datafield>

<subfield code="a">low level descriptors</subfield>

</datafield>

<subfield code="a">Perceived fluency</subfield>

</datafield>

<subfield code="a">raw waveform modelling</subfield>

</datafield>

<subfield code="a">speech assessment</subfield>

</datafield>

<subfield code="a">Zero frequency filtering</subfield>

</datafield>

<subfield code="i">EXTERNAL</subfield>

<subfield code="u">http://publications.idiap.ch/attachments/reports/2021/Dubagunta_Idiap-RR-11-2021.pdf</subfield>

<subfield code="x">PUBLIC</subfield>

</datafield>

<subfield code="a">Idiap-RR-11-2021</subfield>

</datafield>

<subfield code="b">Idiap</subfield>

</datafield>

<subfield code="d">August 2021</subfield>

</datafield>

<subfield code="a">Automatic speech fluency prediction has been mainly approached from the perspective of computer aided language learning, where the system tends to predict ratings similar to those of the human experts. Speech fluency prediction, however, can be questioned in a more relaxed social setting, where the ratings arise mostly from non-experts. This paper explores the latter direction, i.e., prediction of non-expert perceived speech fluency ratings, which has not been studied in the speech technology literature, to the best of our knowledge. Toward that, we investigate different approaches, namely, (a) low-level descriptor feature functionals, (b) bag-of-audio word based approach and (c) neural network based end-to-end acoustic modelling approach. Our investigations on speech data collected from 54 speakers and rated by seven non-experts demonstrate that non-expert speech fluency ratings can be systematically predicted, with the best performing system yielding a Pearson's correlation coefficient of 0.66 and a Spearman's correlation coefficient of 0.67 with the median human scores.</subfield>

</datafield>

</record>

</collection>