Vision
An Algorithmic Theory of Dependent Regularizers, Part 1: Submodular Structure
We present an exploration of the rich theoretical connections between several classes of regularized models, network flows, and recent results in submodular function theory. This work unifies key aspects of these problems under a common theory, leading to novel methods for working with several important models of interest in statistics, machine learning and computer vision. In Part 1, we review the concepts of network flows and submodular function optimization theory foundational to our results. We then examine the connections between network flows and the minimum-norm algorithm from submodular optimization, extending and improving several current results. This leads to a concise representation of the structure of a large class of pairwise regularized models important in machine learning, statistics and computer vision. In Part 2, we describe the full regularization path of a class of penalized regression problems with dependent variables that includes the graph-guided LASSO and total variation constrained models. This description also motivates a practical algorithm. This allows us to efficiently find the regularization path of the discretized version of TV penalized models. Ultimately, our new algorithms scale up to high-dimensional problems with millions of variables.
Cross-Domain Sparse Coding
Sparse coding has shown its power as an effective data representation method. However, up to now, all the sparse coding approaches are limited within the single domain learning problem. In this paper, we extend the sparse coding to cross domain learning problem, which tries to learn from a source domain to a target domain with significant different distribution. We impose the Maximum Mean Discrepancy (MMD) criterion to reduce the cross-domain distribution difference of sparse codes, and also regularize the sparse codes by the class labels of the samples from both domains to increase the discriminative ability. The encouraging experiment results of the proposed cross-domain sparse coding algorithm on two challenging tasks --- image classification of photograph and oil painting domains, and multiple user spam detection --- show the advantage of the proposed method over other cross-domain data representation methods.
From Maxout to Channel-Out: Encoding Information on Sparse Pathways
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
Towards a Neurocognitive Model of Visual Perception
Chakraborty, Arpan (North Carolina State University) | Amant, Robert St. (North Carolina State University)
Natural and artificial vision systems differ considerably in their underlying hardware and their method of information processing. Nevertheless, biological concepts are relevant, adaptable and useful in solving hard computer vision problems. This paper presents a biologically-inspired active vision framework that emulates early visual processing at the neuronal level to accomplish a range of visual tasks. Its emergent behavior is found to be qualitatively similar to humans in certain contexts, and performance is shown to be comparable to computer vision algorithms on a saliency detection task. A neurocognitive model of visual perception based on this framework is motivated.
Boosting OCR Accuracy Using Crowdsourcing
Wang, Shuo-Yang (Academia Sinica) | Wang, Ming-Hung (National Taiwan University) | Chen, Kuan-Ta (Academia Sinica)
Book digitizing is an important work in preserving ancient heritages. However, digitizing books contains a series of labor-intensive works, and one of them is to verify optical character recognition (OCR) outcomes. In this paper, we propose a crowdsourceable OCR verification method. Using our method, content holders are able to leverage the power of crowds to complete verification tasks and avoid content leakage. From the experiment results, our method is more efficient and reliable than the traditional method.
Visual-Semantic Scene Understanding by Sharing Labels in a Context Network
Chakraborty, Ishani, Elgammal, Ahmed
We consider the problem of naming objects in complex, natural scenes containing widely varying object appearance and subtly different names. Informed by cognitive research, we propose an approach based on sharing context based object hypotheses between visual and lexical spaces. To this end, we present the Visual Semantic Integration Model (VSIM) that represents object labels as entities shared between semantic and visual contexts and infers a new image by updating labels through context switching. At the core of VSIM is a semantic Pachinko Allocation Model and a visual nearest neighbor Latent Dirichlet Allocation Model. For inference, we derive an iterative Data Augmentation algorithm that pools the label probabilities and maximizes the joint label posterior of an image. Our model surpasses the performance of state-of-art methods in several visual tasks on the challenging SUN09 dataset.
Geodesic-based Salient Object Detection
Saliency detection has been an intuitive way to provide useful cues for object detection and segmentation, as desired for many vision and graphics applications. In this paper, we provided a robust method for salient object detection and segmentation. Other than using various pixel-level contrast definitions, we exploited global image structures and proposed a new geodesic method dedicated for salient object detection. In the proposed approach, a new geodesic scheme, namely geodesic tunneling is proposed to tackle with textures and local chaotic structures. With our new geodesic approach, a geodesic saliency map is estimated in correspondence to spatial structures in an image. Experimental evaluation on a salient object benchmark dataset validated that our algorithm consistently outperformed a number of the state-of-art saliency methods, yielding higher precision and better recall rates. With the robust saliency estimation, we also present an unsupervised hierarchical salient object cut scheme simply using adaptive saliency thresholding, which attained the highest score in our F-measure test. We also applied our geodesic cut scheme to a number of image editing tasks as demonstrated in additional experiments.
Towards Adapting ImageNet to Reality: Scalable Domain Adaptation with Implicit Low-rank Transformations
Rodner, Erik, Hoffman, Judy, Donahue, Jeff, Darrell, Trevor, Saenko, Kate
Images seen during test time are often not from the same distribution as images used for learning. This problem, known as domain shift, occurs when training classifiers from object-centric internet image databases and trying to apply them directly to scene understanding tasks. The consequence is often severe performance degradation and is one of the major barriers for the application of classifiers in real-world systems. In this paper, we show how to learn transform-based domain adaptation classifiers in a scalable manner. The key idea is to exploit an implicit rank constraint, originated from a max-margin domain adaptation formulation, to make optimization tractable. Experiments show that the transformation between domains can be very efficiently learned from data and easily applied to new categories. This begins to bridge the gap between large-scale internet image collections and object images captured in everyday life environments.
Learning Features and their Transformations by Spatial and Temporal Spherical Clustering
Dutta, Jayanta K., Banerjee, Bonny
Learning features invariant to arbitrary transformations in the data is a requirement for any recognition system, biological or artificial. It is now widely accepted that simple cells in the primary visual cortex respond to features while the complex cells respond to features invariant to different transformations. We present a novel two-layered feedforward neural model that learns features in the first layer by spatial spherical clustering and invariance to transformations in the second layer by temporal spherical clustering. Learning occurs in an online and unsupervised manner following the Hebbian rule. When exposed to natural videos acquired by a camera mounted on a cat's head, the first and second layer neurons in our model develop simple and complex cell-like receptive field properties. The model can predict by learning lateral connections among the first layer neurons. A topographic map to their spatial features emerges by exponentially decaying the flow of activation with distance from one neuron to another in the first layer that fire in close temporal proximity, thereby minimizing the pooling length in an online manner simultaneously with feature learning.
Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition
Gowayyed, Mohammad Abdelaziz (Alexandria University) | Torki, Marwan (Alexandria University) | Hussein, Mohammed Elsayed (Alexandria University) | El-Saban, Motaz (Microsoft Research)
Creating descriptors for trajectories has many applications in robotics/human motion analysis and video copy detection. Here, we propose a novel descriptor for 2D trajectories: Histogram of Oriented Displacements (HOD). Each displacement in the trajectory votes with its length in a histogram of orientation angles. 3D trajectories are described by the HOD of their three projections. We use HOD to describe the 3D trajectories of body joints to recognize human actions, which is a challenging machine vision task, with applications in human-robot/machine interaction, interactive entertainment, multimedia information retrieval, and surveillance. The descriptor is fixed-length, scale-invariant and speed-invariant. Experiments on MSR-Action3D and HDM05 datasets show that the descriptor outperforms the state-of-the-art when using off-the-shelf classification tools.