ARTICLE quelhas:pami:2007/IDIAP A Thousand Words in a Scene Quelhas, Pedro Odobez, Jean-Marc Gatica-Perez, Daniel Tuytelaars, Tinne EXTERNAL https://publications.idiap.ch/attachments/papers/2007/quelhas-pami-2007.pdf PUBLIC https://publications.idiap.ch/index.php/publications/showcite/quelhas:rr05-40 Related documents IEEE Transactions on Pattern Analysis and Machine Intelligence 2007 IDIAP-RR 05-40 This paper presents a novel approach for visual scene modeling and classification, investigating the combined use of text modeling methods and local invariant features. Our work attempts to elucidate (1) whether a text-like \emph{bag-of-visterms} representation (histogram of quantized local visual features) is suitable for scene (rather than object) classification, (2) whether some analogies between discrete scene representations and text documents exist, and (3) whether unsupervised, latent space models can be used both as feature extractors for the classification task and to discover patterns of visual co-occurrence. Using several data sets, we validate our approach, presenting and discussing experiments on each of these issues. We first show, with extensive experiments on binary and multi-class scene classification tasks using a 9500-image data set, that the \emph{bag-of-visterms} representation consistently outperforms classical scene classification approaches. In other data sets we show that our approach competes with or outperforms other recent, more complex, methods. We also show that Probabilistic Latent Semantic Analysis (PLSA) generates a compact scene representation, discriminative for accurate classification, and more robust than the \emph{bag-of-visterms} representation when less labeled training data is available. Finally, through aspect-based image ranking experiments, we show the ability of PLSA to automatically extract visually meaningful scene patterns, making such representation useful for browsing image collections. REPORT quelhas:rr05-40/IDIAP A Thousand Words in a Scene Quelhas, Pedro Odobez, Jean-Marc Gatica-Perez, Daniel Tuytelaars, Tinne EXTERNAL https://publications.idiap.ch/attachments/reports/2005/quelhas-idiap-rr-05-40.pdf PUBLIC Idiap-RR-40-2005 2005 IDIAP This paper presents a novel approach for visual scene modeling and classification, investigating the combined use of text modeling methods and local invariant features. Our work attempts to elucidate (1) whether a text-like \emph{bag-of-visterms} representation (histogram of quantized local visual features) is suitable for scene (rather than object) classification, (2) whether some analogies between discrete scene representations and text documents exist, and (3) whether unsupervised, latent space models can be used both as feature extractors for the classification task and to discover patterns of visual co-occurrence. Using several data sets, we validate our approach, presenting and discussing experiments on each of these issues. We first show, with extensive experiments on binary and multi-class scene classification tasks using a 9500-image data set, that the \emph{bag-of-visterms} representation consistently outperforms classical scene classification approaches. In other data sets we show that our approach competes with or outperforms other recent, more complex, methods. We also show that Probabilistic Latent Semantic Analysis (PLSA) generates a compact scene representation, discriminative for accurate classification, and more robust than the \emph{bag-of-visterms} representation when less labeled training data is available. Finally, through aspect-based image ranking experiments, we show the ability of PLSA to automatically extract visually meaningful scene patterns, making such representation useful for browsing image collections.