CONF
Friedland_ICASSP_2009/IDIAP
MULTI-MODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING COMPRESSED-DOMAIN VIDEO FEATURES
Friedland, Gerald
Hung, Hayley
Yeo, Chuohao
EXTERNAL
https://publications.idiap.ch/attachments/papers/2009/Friedland_ICASSP_2009.pdf
PUBLIC
International Conference on Audio, Speech and Signal Processing
2009
Speaker diarization is originally defined as the task of de-
termining “who spoke when†given an audio track and no
other prior knowledge of any kind. The following article
shows a multi-modal approach where we improve a state-
of-the-art speaker diarization system by combining standard
acoustic features (MFCCs) with compressed domain video
features. The approach is evaluated on over 4.5 hours of
the publicly available AMI meetings dataset which contains
challenges such as people standing up and walking out of the
room. We show a consistent improvement of about 34 % rela-
tive in speaker error rate (21 % DER) compared to a state-of-
the-art audio-only baseline.