REPORT Hung_Idiap-RR-20-2009/IDIAP Speech/Non-Speech Detection in Meetings from Automatically Extracted Low Resolution Visual Features Hung, Hayley Ba, Silèye O. EXTERNAL http://publications.idiap.ch/attachments/reports/2009/Hung_Idiap-RR-20-2009.pdf PUBLIC Idiap-RR-20-2009 2009 Idiap July 2009 submitted to icmi-mlmi In this paper we address the problem of estimating who is speaking from automatically extracted low resolution visual cues from group meetings. Traditionally, the task of speech/non-speech detection or speaker diarization tries to find who speaks and when from audio features only. Recent work has addressed the problem audio-visually but often with less emphasis on the visual component. Due to the high probability of losing the audio stream during video conferences, this work proposes methods for estimating speech using just low resolution visual cues. We carry out experiments to compare how context through the observation of group behaviour and task-oriented activities can help improve estimates of speaking status. We test on 105 minutes of natural meeting data with unconstrained conversations.

<subfield code="a">REPORT</subfield>

</datafield>

<subfield code="a">Hung_Idiap-RR-20-2009/IDIAP</subfield>

</datafield>

<subfield code="a">Speech/Non-Speech Detection in Meetings from Automatically Extracted Low Resolution Visual Features</subfield>

</datafield>

<subfield code="a">Hung, Hayley</subfield>

</datafield>

<subfield code="a">Ba, Silèye O.</subfield>

</datafield>

<subfield code="i">EXTERNAL</subfield>

<subfield code="u">http://publications.idiap.ch/attachments/reports/2009/Hung_Idiap-RR-20-2009.pdf</subfield>

<subfield code="x">PUBLIC</subfield>

</datafield>

<subfield code="a">Idiap-RR-20-2009</subfield>

</datafield>

<subfield code="b">Idiap</subfield>

</datafield>

</datafield>

<subfield code="a">submitted to icmi-mlmi</subfield>

</datafield>

<subfield code="a">In this paper we address the problem of estimating who is speaking from automatically extracted low resolution visual cues from group meetings. Traditionally, the task of speech/non-speech detection or speaker diarization tries to find who speaks and when from audio features only. Recent work has addressed the problem audio-visually but often with less emphasis on the visual component. Due to the high probability of losing the audio stream during video conferences, this work proposes methods for estimating speech using just low resolution visual cues. We carry out experiments to compare how context through the observation of group behaviour and task-oriented activities can help improve estimates of speaking status. We test on 105 minutes of natural meeting data with unconstrained conversations.</subfield>

</datafield>

</record>

</collection>