Large numbers of descriptors and large codebooks are needed for good results and this becomes slow using k-means. We introduce Extremely Randomized Clustering Forests - ensembles of randomly created clustering trees - and show that these provide more accurate results, much faster training and testing and good resistance to background clutter in several state-of-the-art image classification tasks.
Department of Computer Science University of Rochester Rochester, NY 14627 Abstract One of the advantages of supervised learning is that the final error metric isavailable during training. For classifiers, the algorithm can directly reduce the number of misclassifications on the training set. Unfortunately, whenmodeling human learning or constructing classifiers for autonomous robots,supervisory labels are often not available or too expensive. In this paper we show that we can substitute for the labels by making use of structure between the pattern distributions to different sensory modalities.We show that minimizing the disagreement between the outputs of networks processing patterns from these different modalities is a sensible approximation to minimizing the number of misclassifications in each modality, and leads to similar results. Using the Peterson-Barney vowel dataset we show that the algorithm performs well in finding appropriate placementfor the codebook vectors particularly when the confuseable classes are different for the two modalities. 1 INTRODUCTION This paper addresses the question of how a human or autonomous robot can learn to classify new objects without experience with previous labeled examples.
Image clustering and visual codebook learning are two fundamental problems in computer vision and they are tightly related. On one hand, a good codebook can generate effective feature representations which largely affect clustering performance. On the other hand, class labels obtained from image clustering can serve as supervised information to guide codebook learning. Traditionally, these two processes are conducted separately and their correlation is generally ignored.In this paper, we propose a Double Layer Gaussian Mixture Model (DLGMM) to simultaneously perform image clustering and codebook learning. In DLGMM, two tasks are seamlessly coupled and can mutually promote each other. Cluster labels and codebook are jointly estimated to achieve the overall best performance. To incorporate the spatial coherence between neighboring visual patches, we propose a Spatially Coherent DLGMM which uses a Markov Random Field to encourage neighboring patches to share the same visual word label.We use variational inference to approximate the posterior of latent variables and learn model parameters.Experiments on two datasets demonstrate the effectiveness of two models.
Ji, Rongrong (Peking University and Harbin Institute of Technology) | Duan, Ling-Yu (Peking University) | Chen, Jie (Peking University) | Yao, Hongxun (Harbin Institute of Technology) | Huang, Tiejun (Peking University) | Gao, Wen (Peking University)
In this paper, we propose to extract a compact yet discriminative visual descriptor directly on the mobile device, which tackles the wireless query transmission latency in mobile landmark search. This descriptor is offline learnt from the location contexts of geo-tagged Web photos from both Flickr and Panoramio with two phrases: First, we segment the landmark photo collections into discrete geographical regions using a Gaussian Mixture Model [Stauffer et al., 2000]. Second, a ranking sensitive vocabulary boosting is introduced to learn a compact codebook within each region. To tackle the locally optimal descriptor learning caused by imprecise geographical segmentation, we further iterate above phrases by feedback an "entropy" based descriptor compactness into a prior distribution to constrain the Gaussian mixture modeling. Consequently, when entering a specific geographical region, the codebook in the mobile device is downstream adapted, which ensures efficient extraction of compact descriptor, its low bit rate transmission, as well as promising discrimination ability. We deploy our descriptor within both HTC and iPhone mobile phones, testing landmark search in typical areas included Beijing, New York, and Barcelona containing one million images. Our learning descriptor outperforms alternative compact descriptors [Chen et al., 2009][Chen et al., 2010][Chandrasekhar et al., 2009a][Chandrasekhar et al., 2009b] with a large margin.
Cross-modal similarity retrieval is a problem about designing a retrieval system that supports querying across content modalities, e.g., using an image to retrieve for texts. This paper presents a compact coding solution for efficient cross-modal retrieval, with a focus on the quantization approach which has already shown the superior performance over the hashing solutions in single-modal similarity retrieval. We propose a collective deep quantization (CDQ) approach, which is the first attempt to introduce quantization in end-to-end deep architecture for cross-modal retrieval. The major contribution lies in jointly learning deep representations and the quantizers for both modalities using carefully-crafted hybrid networks and well-specified loss functions. In addition, our approach simultaneously learns the common quantizer codebook for both modalities through which the cross-modal correlation can be substantially enhanced. CDQ enables efficient and effective cross-modal retrieval using inner product distance computed based on the common codebook with fast distance table lookup. Extensive experiments show that CDQ yields state of the art cross-modal retrieval results on standard benchmarks.