CONF VILLATORO-TELLO_ICASSP2023-2_2023/IDIAP Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks Villatoro-Tello, Esaú Madikeri, Srikanth Zuluaga-Gomez, Juan Sharma, Bidisha Sarfjoo, Seyyed Saeed Iuliia, Nigmatulina Motlicek, Petr Ivanov, Alexei V. Ganapathiraju, Aravind Cross-modal Attention Human-Computer Interaction speech recognition Spoken Language Understanding Word Consensus Networks EXTERNAL http://publications.idiap.ch/attachments/papers/2023/VILLATORO-TELLO_ICASSP2023-2_2023.pdf PUBLIC Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing 2023 In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.

</datafield>

<subfield code="a">VILLATORO-TELLO_ICASSP2023-2_2023/IDIAP</subfield>

</datafield>

<subfield code="a">Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks</subfield>

</datafield>

<subfield code="a">Villatoro-Tello, Esaú</subfield>

</datafield>

<subfield code="a">Madikeri, Srikanth</subfield>

</datafield>

<subfield code="a">Zuluaga-Gomez, Juan</subfield>

</datafield>

<subfield code="a">Sharma, Bidisha</subfield>

</datafield>

<subfield code="a">Sarfjoo, Seyyed Saeed</subfield>

</datafield>

<subfield code="a">Iuliia, Nigmatulina</subfield>

</datafield>

<subfield code="a">Motlicek, Petr</subfield>

</datafield>

<subfield code="a">Ivanov, Alexei V.</subfield>

</datafield>

<subfield code="a">Ganapathiraju, Aravind</subfield>

</datafield>

<subfield code="a">Cross-modal Attention</subfield>

</datafield>

<subfield code="a">Human-Computer Interaction</subfield>

</datafield>

<subfield code="a">speech recognition</subfield>

</datafield>

<subfield code="a">Spoken Language Understanding</subfield>

</datafield>

<subfield code="a">Word Consensus Networks</subfield>

</datafield>

<subfield code="i">EXTERNAL</subfield>

<subfield code="u">http://publications.idiap.ch/attachments/papers/2023/VILLATORO-TELLO_ICASSP2023-2_2023.pdf</subfield>

<subfield code="x">PUBLIC</subfield>

</datafield>

<subfield code="a">Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing</subfield>

</datafield>

</datafield>

<subfield code="a">In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.</subfield>

</datafield>

</record>

</collection>