ARTICLE
gatica05db/IDIAP
Audio-visual probabilistic tracking of multiple speakers in meetings
Gatica-Perez, Daniel
Lathoud, Guillaume
Odobez, Jean-Marc
McCowan, Iain A.
EXTERNAL
https://publications.idiap.ch/attachments/reports/2005/rr-05-27.pdf
PUBLIC
https://publications.idiap.ch/index.php/publications/showcite/gatica05d
Related documents
IEEE Trans. on Audio, Speech, and Language Processing, accepted for publication.
2006
March 2006
IDIAP-RR 05-27
Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF,',','),
which results in high sampling efficiency. We present results -based on an objective evaluation procedure- that show that our framework (1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy; (2) can deal with cases of visual clutter and partial occlusion; and (3) significantly outperforms a traditional sampling-based approach.
REPORT
gatica05d/IDIAP
Audio-visual probabilistic tracking of multiple speakers in meetings
Gatica-Perez, Daniel
Lathoud, Guillaume
Odobez, Jean-Marc
McCowan, Iain A.
EXTERNAL
https://publications.idiap.ch/attachments/reports/2005/rr-05-27.pdf
PUBLIC
Idiap-RR-27-2005
2005
IDIAP
Martigny, Switzerland
submitted for publication
Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF,',','),
which results in high sampling efficiency. We present results -based on an objective evaluation procedure- that show that our framework (1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy; (2) can deal with cases of visual clutter and partial occlusion; and (3) significantly outperforms a traditional sampling-based approach.