REPORT luettin-RR-98-02/IDIAP Continuous Audio-Visual Speech Recognition Luettin, Juergen Dupont, Stéphane EXTERNAL http://publications.idiap.ch/attachments/reports/1998/rr98-02.pdf PUBLIC Idiap-RR-02-1998 1998 IDIAP Published in Proc. 5th European Conference on Computer Vision, 1998 We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal modelling of the acoustic and visual speech signals by applying Multi-Stream hidden Markov models. This approach allows the use of different temporal topologies and levels of stream integration and hence enables to model temporal dependencies more accurately. The system has been evaluated for a continuously spoken digit recognition task of 37 subjects.

<subfield code="a">REPORT</subfield>

</datafield>

<subfield code="a">luettin-RR-98-02/IDIAP</subfield>

</datafield>

<subfield code="a">Continuous Audio-Visual Speech Recognition</subfield>

</datafield>

<subfield code="a">Luettin, Juergen</subfield>

</datafield>

<subfield code="a">Dupont, Stéphane</subfield>

</datafield>

<subfield code="i">EXTERNAL</subfield>

<subfield code="u">http://publications.idiap.ch/attachments/reports/1998/rr98-02.pdf</subfield>

<subfield code="x">PUBLIC</subfield>

</datafield>

<subfield code="a">Idiap-RR-02-1998</subfield>

</datafield>

<subfield code="b">IDIAP</subfield>

</datafield>

<subfield code="a">Published in Proc. 5th European Conference on Computer Vision, 1998</subfield>

</datafield>

<subfield code="a">We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal modelling of the acoustic and visual speech signals by applying Multi-Stream hidden Markov models. This approach allows the use of different temporal topologies and levels of stream integration and hence enables to model temporal dependencies more accurately. The system has been evaluated for a continuously spoken digit recognition task of 37 subjects.</subfield>

</datafield>

</record>

</collection>