Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

We use cookies

This website uses cookies and other tracking technologies to improve your browsing experience for the following purposes: to enable basic functionality of the website, to provide a better experience on the website, to measure your interest in our products and services and to personalize marketing interactions, to deliver ads that are more relevant to you.

[BibTeX] [Marc21]

Type of publication:	Conference paper
Citation:	Le_ACMMM_2016
Booktitle:	ACM Multimedia
Year:	2016
Month:	October
Publisher:	ACM
Location:	Amsterdam
Abstract:	Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines.
Keywords:
Projects	EUMSSI
Authors	Le, Nam Odobez, Jean-Marc
Added by:	[UNK]
Total mark:	0
Attachments
Le_ACMMM_2016.pdf
Notes

processing time: 0.0003 seconds.