Plotting

 Fergus, Rob


End-To-End Memory Networks

Neural Information Processing Systems

We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network (Weston et al., 2015) but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol. The flexibility of the model allows us to apply it to tasks as diverse as (synthetic) question answering and to language modeling. For the former our approach is competitive with Memory Networks, but with less supervision. For the latter, on the Penn TreeBank and Text8 datasets our approach demonstrates comparable performance to RNNs and LSTMs. In both cases we show that the key concept of multiple computational hops yields improved results.


Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

Neural Information Processing Systems

We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2×, while keeping the accuracy within 1% of the original model.


Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Neural Information Processing Systems

Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.


Learning to Discover Efficient Mathematical Identities

Neural Information Processing Systems

In this paper we explore how machine learning techniques can be applied to the discovery of efficient mathematical identities. We introduce an attribute grammar framework for representing symbolic expressions. Given a grammar of math operators, we build trees that combine them in different ways, looking for compositions that are analytically equivalent to a target expression but of lower computational complexity. However, as the space of trees grows exponentially with the complexity of the target expression, brute force search is impractical for all but the simplest of expressions. Consequently, we introduce two novel learning approaches that are able to learn from simpler expressions to guide the tree search. The first of these is a simple n-gram model, the other being a recursive neural-network. We show how these approaches enable us to derive complex identities, beyond reach of brute-force search, or human derivation.


Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

arXiv.org Artificial Intelligence

We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the linear structure present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2x, while keeping the accuracy within 1% of the original model.


Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

arXiv.org Machine Learning

We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.


Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

Neural Information Processing Systems

We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data.


Pose-Sensitive Embedding by Nonlinear NCA Regression

Neural Information Processing Systems

This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. Our method is convolutional, enabling it to scale to realistically-sized images. By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. We evaluate our method quantitatively against other embedding methods. We also demonstrate that real-world performance can be improved through the use of synthetic data.


Case for Automated Detection of Diabetic Retinopathy

AAAI Conferences

Diabetic retinopathy, an eye disorder caused by diabetes, is the primary cause of blindness in America and over 99% of cases in India. India and China currently account for over 90 million diabetic patients and are on the verge of an explosion of diabetic populations. This may result in an unprecedented number of persons becoming blind unless diabetic retinopathy can be detected early. Aravind Eye Hospitals is the largest eye care facility in the world, handling over 2 million patients per year. The hospital is on a massive drive throughout southern India to detect diabetic retinopathy at an early stage. To that end, a group of 10-15 physicians are responsible for manually diagnosing over 2 million retinal images per year to detect diabetic retinopathy. While the task is extremely laborious, a large fraction of cases turn out to be normal indicating that much of this time is spent diagnosing completely normal cases. This paper describes our early experiences working with Aravind Eye Hospitals to develop an automated system to detect diabetic retinopathy from retinal images. The automated diabetic retinopathy problem is a hard computer vision problem whose goal is to detect features of retinopathy, such as hemorrhages and exudates, in retinal color fundus images. We describe our initial efforts towards building such a system using a range of computer vision techniques and discuss the potential impact on early detection of diabetic retinopathy.


Semi-Supervised Learning in Gigantic Image Collections

Neural Information Processing Systems

With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. ``Clean labels can be manually obtained on a small fraction, ``noisy labels may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images.  Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. We combine this with a label sharing framework obtained from Wordnet to propagate label information to classes lacking manual annotations. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images with 74 thousand classes.