mapping
Shape and Material from Sound
Hearing an object falling onto the ground, humans can recover rich information including its rough shape, material, and falling height. In this paper, we build machines to approximate such competency. We first mimic human knowledge of the physical world by building an efficient, physics-based simulation engine. Then, we present an analysis-by-synthesis approach to infer properties of the falling object. We further accelerate the process by learning a mapping from a sound wave to object properties, and using the predicted values to initialize the inference. This mapping can be viewed as an approximation of human commonsense learned from past experience. Our model performs well on both synthetic audio clips and real recordings without requiring any annotated data. We conduct behavior studies to compare human responses with ours on estimating object shape, material, and falling height from sound. Our model achieves near-human performance.
Gaussian process based nonlinear latent structure discovery in multivariate spike train data
A large body of recent work focuses on methods for extracting low-dimensional latent structure from multi-neuron spike train data. Most such methods employ either linear latent dynamics or linear mappings from latent space to log spike rates. Here we propose a doubly nonlinear latent variable model that can identify low-dimensional structure underlying apparently high-dimensional spike train data. We introduce the Poisson Gaussian-Process Latent Variable Model (P-GPLVM), which consists of Poisson spiking observations and two underlying Gaussian processes--one governing a temporal latent variable and another governing a set of nonlinear tuning curves. The use of nonlinear tuning curves enables discovery of low-dimensional latent structure even when spike responses exhibit high linear dimensionality (e.g., as found in hippocampal place cell codes). To learn the model from data, we introduce the decoupled Laplace approximation, a fast approximate inference method that allows us to efficiently optimize the latent path while marginalizing over tuning curves. We show that this method outperforms previous Laplace-approximation-based inference methods in both the speed of convergence and accuracy. We apply the model to spike trains recorded from hippocampal place cells and show that it compares favorably to a variety of previous methods for latent structure discovery, including variational auto-encoder (VAE) based methods that parametrize the nonlinear mapping from latent space to spike rates with a deep neural network.
Toward Multimodal Image-to-Image Translation
Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a distribution of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.
Contextual semibandits via supervised learning oracles
We study an online decision making problem where on each round a learner chooses a list of items based on some side information, receives a scalar feedback value for each individual item, and a reward that is linearly related to this feedback. These problems, known as contextual semibandits, arise in crowdsourcing, recommendation, and many other domains. This paper reduces contextual semibandits to supervised learning, allowing us to leverage powerful supervised learning methods in this partial-feedback setting. Our first reduction applies when the mapping from feedback to reward is known and leads to a computationally efficient algorithm with near-optimal regret. We show that this algorithm outperforms state-of-the-art approaches on real-world learning-to-rank datasets, demonstrating the advantage of oracle-based algorithms. Our second reduction applies to the previously unstudied setting when the linear mapping from feedback to reward is unknown. Our regret guarantees are superior to prior techniques that ignore the feedback.
Large Margin Discriminant Dimensionality Reduction in Prediction Space
In this paper we establish a duality between boosting and SVM, and use this to derive a novel discriminant dimensionality reduction algorithm. In particular, using the multiclass formulation of boosting and SVM we note that both use a combination of mapping and linear classification to maximize the multiclass margin. In SVM this is implemented using a pre-defined mapping (induced by the kernel) and optimizing the linear classifiers. In boosting the linear classifiers are pre-defined and the mapping (predictor) is learned through combination of weak learners. We argue that the intermediate mapping, e.g.
Geometry-Aware Recurrent Neural Networks for Active Visual Recognition
We present recurrent geometry-aware neural networks that integrate visual information across multiple views of a scene into 3D latent feature tensors, while maintaining an one-to-one mapping between 3D physical locations in the world scene and latent feature locations. Object detection, object segmentation, and 3D reconstruction is then carried out directly using the constructed 3D feature memory, as opposed to any of the input 2D images. The proposed models are equipped with differentiable egomotion-aware feature warping and (learned) depth-aware unprojection operations to achieve geometrically consistent mapping between the features in the input frame and the constructed latent model of the scene. We empirically show the proposed model generalizes much better than geometry-unaware LSTM/GRU networks, especially under the presence of multiple objects and cross-object occlusions. Combined with active view selection policies, our model learns to select informative viewpoints to integrate information from by "undoing cross-object occlusions, seamlessly combining geometry with learning from experience.
One-Shot Unsupervised Cross Domain Translation
We perform a wide variety of experiments and demonstrate that OST outperforms the existing algorithms in the low-shot scenario. On most datasets the method also presents a comparable accuracy with a single training example to the accuracy obtained by the other methods for the entire set of domainA images.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Middle East > Israel (0.04)
- Europe > Germany > Saarland (0.04)
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (4 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Poland > Masovia Province > Warsaw (0.04)
- (2 more...)
- North America > United States (0.14)
- Europe > United Kingdom (0.04)
- Asia > China > Guangdong Province > Zhuhai (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)