Image Processing
A Discriminative Latent Model of Image Region and Object Tag Correspondence
We propose a discriminative latent model for annotating images with unaligned object-level textual annotations. Instead of using the bag-of-words image representation currently popular in the computer vision community, our model explicitly captures more intricate relationships underlying visual and textual information. In particular, we model the mapping that translates image regions to annotations. This mapping allows us to relate image regions to their corresponding annotation terms. We also model the overall scene label as latent information. This allows us to cluster test images. Our training data consist of images and their associated annotations. But we do not have access to the ground-truth region-to-annotation mapping or the overall scene label. We develop a novel variant of the latent SVM framework to model them as latent variables. Our experimental results demonstrate the effectiveness of the proposed model compared with other baseline methods.
Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
Chen, Ning, Zhu, Jun, Xing, Eric P.
Learning from multi-view data is important in many applications, such as image classification and annotation. In this paper, we present a large-margin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables. We provide efficient inference and parameter estimation methods for the latent subspace model. Finally, we demonstrate the advantages of large-margin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval.
Size Matters: Metric Visual Search Constraints from Monocular Metadata
Fritz, Mario, Saenko, Kate, Darrell, Trevor
Metric constraints are known to be highly discriminative for many objects, but if training is limited to data captured from a particular 3-D sensor the quantity of training data may be severly limited. In this paper, we show how a crucial aspect of 3-D informationโobject and feature absolute sizeโcan be added to models learned from commonly available online imagery, without use of any 3-D sensing or re- construction at training time. Such models can be utilized at test time together with explicit 3-D sensing to perform robust search. Our model uses a โ2.1Dโ local feature, which combines traditional appearance gradient statistics with an estimate of average absolute depth within the local window. We show how category size information can be obtained from online images by exploiting relatively unbiquitous metadata fields specifying camera intrinstics. We develop an efficient metric branch-and-bound algorithm for our search task, imposing 3-D size constraints as part of an optimal search for a set of features which indicate the presence of a category. Experiments on test scenes captured with a traditional stereo rig are shown, exploiting training data from from purely monocular sources with associated EXIF metadata.
Group Sparse Coding with a Laplacian Scale Mixture Prior
Garrigues, Pierre, Olshausen, Bruno A.
We propose a class of sparse coding models that utilizes a Laplacian Scale Mixture (LSM) prior to model dependencies among coefficients. Each coefficient is modeled as a Laplacian distribution with a variable scale parameter, with a Gamma distribution prior over the scale parameter. We show that, due to the conjugacy of the Gamma prior, it is possible to derive efficient inference procedures for both the coefficients and the scale parameter. When the scale parameters of a group of coefficients are combined into a single variable, it is possible to describe the dependencies that occur due to common amplitude fluctuations among coefficients, which have been shown to constitute a large fraction of the redundancy in natural images. We show that, as a consequence of this group sparse coding, the resulting inference of the coefficients follows a divisive normalization rule, and that this may be efficiently implemented a network architecture similar to that which has been proposed to occur in primary visual cortex. We also demonstrate improvements in image coding and compressive sensing recovery using the LSM model.
Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures
Motwani, Kamiya, Adluru, Nagesh, Hinrichs, Chris, Alexander, Andrew, Singh, Vikas
We study the problem of segmenting specific white matter structures of interest from Diffusion Tensor (DT-MR) images of the human brain. This is an important requirement in many Neuroimaging studies: for instance, to evaluate whether a brain structure exhibits group level differences as a function of disease in a set of images. Typically, interactive expert guided segmentation has been the method of choice for such applications, but this is tedious for large datasets common today. To address this problem, we endow an image segmentation algorithm with 'advice' encoding some global characteristics of the region(s) we want to extract. This is accomplished by constructing (using expert-segmented images) an epitome of a specific region - as a histogram over a bag of 'words' (e.g.,suitable feature descriptors). Now, given such a representation, the problem reduces to segmenting new brain image with additional constraints that enforce consistency between the segmented foreground and the pre-specified histogram over features. We present combinatorial approximation algorithms to incorporate such domain specific constraints for Markov Random Field (MRF) segmentation. Making use of recent results on image co-segmentation, we derive effective solution strategies for our problem. We provide an analysis of solution quality, and present promising experimental evidence showing that many structures of interest in Neuroscience can be extracted reliably from 3-D brain image volumes using our algorithm.
Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification
Li, Li-jia, Su, Hao, Fei-fei, Li, Xing, Eric P.
Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings. For high level visual tasks, such low-level image representations are potentially not enough. In this paper, we propose a high-level image representation, called the Object Bank, where an image is represented as a scale invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task. Leveraging on the Object Bank representation, superior performances on high level visual recognition tasks can be achieved with simple off-the-shelf classifiers such as logistic regression and linear SVM. Sparsity algorithms make our representation more efficient and scalable for large scene datasets, and reveal semantically meaningful feature patterns.
Pose-Sensitive Embedding by Nonlinear NCA Regression
Taylor, Graham W., Fergus, Rob, Williams, George, Spiro, Ian, Bregler, Christoph
This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. Our method is convolutional, enabling it to scale to realistically-sized images. By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. We evaluate our method quantitatively against other embedding methods. We also demonstrate that real-world performance can be improved through the use of synthetic data.
PADDLE: Proximal Algorithm for Dual Dictionaries LEarning
Basso, Curzio, Santoro, Matteo, Verri, Alessandro, Villa, Silvia
Recently, considerable research efforts have been devoted to the design of methods to learn from data overcomplete dictionaries for sparse coding. However, learned dictionaries require the solution of an optimization problem for coding new data. In order to overcome this drawback, we propose an algorithm aimed at learning both a dictionary and its dual: a linear mapping directly performing the coding. By leveraging on proximal methods, our algorithm jointly minimizes the reconstruction error of the dictionary and the coding error of its dual; the sparsity of the representation is induced by an $\ell_1$-based penalty on its coefficients. The results obtained on synthetic data and real images show that the algorithm is capable of recovering the expected dictionaries. Furthermore, on a benchmark dataset, we show that the image features obtained from the dual matrix yield state-of-the-art classification performance while being much less computational intensive.
Tensor Product of Correlated Textual and Visual Features: A Quantum Theory Inspired Image Retrieval Framework
Wang, Jun (Robert Gordon University) | Song, Dawei (Robert Gordon University) | Kaliciak, Leszek (Robert Gordon University)
In multimedia information retrieval, where a document may contain both textual and visual content features, the ranking of documents is often computed by heuristically combining the feature spaces of different media types or combining the ranking scores computed independently from different feature spaces. In this paper, we propose a principled approach inspired by quantum theory. Specifically, we propose a tensor product based model aiming to represent textual and visual content features of an image as a non-separable composite system. The ranking scores of the images are then computed in the form of a quantum measurement. In addition, the correlations between features of different media types are incorporated in the framework. Experiments on ImageClef2007 show a promising performance of the tensor based approach.
A Simple CW-SSIM Kernel-based Nearest Neighbor Method for Handwritten Digit Classification
Wang, Jiheng, Fan, Guangzhe, Wang, Zhou
We propose a simple kernel based nearest neighbor approach for handwritten digit classification. The "distance" here is actually a kernel defining the similarity between two images. We carefully study the effects of different number of neighbors and weight schemes and report the results. With only a few nearest neighbors (or most similar images) to vote, the test set error rate on MNIST database could reach about 1.5%-2.0%, which is very close to many advanced models.