Deep Learning
Maxout Networks
Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, Bengio, Yoshua
We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.
Temporal Autoencoding Improves Generative Models of Time Series
Häusler, Chris, Susemihl, Alex, Nawrot, Martin P, Opper, Manfred
Restricted Boltzmann Machines (RBMs) are generative models which can learn useful representations from samples of a dataset in an unsupervised fashion. They have been widely employed as an unsupervised pre-training method in machine learning. RBMs have been modified to model time series in two main ways: The Temporal RBM stacks a number of RBMs laterally and introduces temporal dependencies between the hidden layer units; The Conditional RBM, on the other hand, considers past samples of the dataset as a conditional bias and learns a representation which takes these into account. Here we propose a new training method for both the TRBM and the CRBM, which enforces the dynamic structure of temporal datasets. We do so by treating the temporal models as denoising autoencoders, considering past frames of the dataset as corrupted versions of the present frame and minimizing the reconstruction error of the present data by the model. We call this approach Temporal Autoencoding. This leads to a significant improvement in the performance of both models in a filling-in-frames task across a number of datasets. The error reduction for motion capture data is 56\% for the CRBM and 80\% for the TRBM. Taking the posterior mean prediction instead of single samples further improves the model's estimates, decreasing the error by as much as 91\% for the CRBM on motion capture data. We also trained the model to perform forecasting on a large number of datasets and have found TA pretraining to consistently improve the performance of the forecasts. Furthermore, by looking at the prediction error across time, we can see that this improvement reflects a better representation of the dynamics of the data as opposed to a bias towards reconstructing the observed data on a short time scale.
Guided Self-Organization of Input-Driven Recurrent Neural Networks
Obst, Oliver, Boedecker, Joschka
We review attempts that have been made towards understanding the computational properties and mechanisms of input-driven dynamical systems like RNNs, and reservoir computing networks in particular. We provide details on methods that have been developed to give quantitative answers to the questions above. Following this, we show how self-organization may be used to improve reservoirs for better performance, in some cases guided by the measures presented before. We also present a possible way to quantify task performance using an information-theoretic approach, and finally discuss promising future directions aimed at a better understanding of how these systems perform their computations and how to best guide self-organized processes for their optimization.
Pylearn2: a machine learning research library
Goodfellow, Ian J., Warde-Farley, David, Lamblin, Pascal, Dumoulin, Vincent, Mirza, Mehdi, Pascanu, Razvan, Bergstra, James, Bastien, Frédéric, Bengio, Yoshua
Pylearn2 is a machine learning research library. This does not just mean that it is a collection of machine learning algorithms that share a common API; it means that it has been designed for flexibility and extensibility in order to facilitate research projects that involve new or unusual use cases. In this paper we give a brief history of the library, an overview of its basic philosophy, a summary of the library's architecture, and a description of how the Pylearn2 community functions socially.
When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity
Anandkumar, Animashree, Hsu, Daniel, Janzamin, Majid, Kakade, Sham
Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence. Our sufficient conditions for identifiability involve a novel set of "higher order" expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allows for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition.
Learning Features and their Transformations by Spatial and Temporal Spherical Clustering
Dutta, Jayanta K., Banerjee, Bonny
Learning features invariant to arbitrary transformations in the data is a requirement for any recognition system, biological or artificial. It is now widely accepted that simple cells in the primary visual cortex respond to features while the complex cells respond to features invariant to different transformations. We present a novel two-layered feedforward neural model that learns features in the first layer by spatial spherical clustering and invariance to transformations in the second layer by temporal spherical clustering. Learning occurs in an online and unsupervised manner following the Hebbian rule. When exposed to natural videos acquired by a camera mounted on a cat's head, the first and second layer neurons in our model develop simple and complex cell-like receptive field properties. The model can predict by learning lateral connections among the first layer neurons. A topographic map to their spatial features emerges by exponentially decaying the flow of activation with distance from one neuron to another in the first layer that fire in close temporal proximity, thereby minimizing the pooling length in an online manner simultaneously with feature learning.
Deep Feature Learning Using Target Priors with Applications in ECoG Signal Decoding for BCI
Wang, Zuoguan (Rensselaer Polytechnic Institute) | Lyu, Siwei (University at Albany, SUNY) | Schalk, Gerwin (Wadsworth Center) | Ji, Qiang (Rensselaer Polytechnic Institute)
Recent years have seen a great interest in using deep architectures for feature learning from data. One drawback of the commonly used unsupervised deep feature learning methods is that for supervised or semi-supervised learning tasks, the information in the target variables are not used until the final stage when the classifier or regressor is trained on the learned features. This could lead to over-generalized features that are not competitive on the specific supervised or semi-supervised learning tasks. In this work, we describe a new learning method that combines deep feature learning on mixed labeled and unlabeled data sets. Specifically, we describe a weakly supervised learning method of a prior supervised convolutional stacked auto-encoders (PCSA), of which information in the target variables is represented probabilistically using a Gaussian Bernoulli restricted Boltzmann machine (RBM). We apply this method to the decoding problem of an ECoG based Brain Computer Interface (BCI) system. Our experimental results show that PCSA achieves significant improvement in decoding performance on benchmark data sets compared to the unsupervised feature learning as well as to the current state-of-the-art algorithms that are based on manually crafted features.
Conditional Restricted Boltzmann Machines for Negotiations in Highly Competitive and Complex Domains
Chen, Siqi (Maastricht University) | Ammar, Haitham Bou (Maastricht University) | Tuyls, Karl (Maastricht University) | Weiss, Gerhard (Maastricht University)
Learning in automated negotiations, while useful, is hard because of the indirect way the target function can be observed and the limited amount of experience available to learn from. This paper proposes two novel opponent modeling techniques based on deep learning methods. Moreover, to improve the learning efficacy of negotiating agents, the second approach is also capable of transferring knowledge efficiently between negotiation tasks. Transfer is conducted by automatically mapping the source knowledge to the target in a rich feature space. Experiments show that using these techniques the proposed strategies outperform existing state-of-the-art agents in highly competitive and complex negotiation domains. Furthermore, the empirical game theoretic analysis reveals the robustness of the proposed strategies.
Knowledge Matters: Importance of Prior Information for Optimization
Gülçehre, Çağlar, Bengio, Yoshua
We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed to learn. We motivate our work from the hypothesis that humans learn such intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The final task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations. The first part of the two-tiered MLP is pre-trained with intermediate-level targets being the presence of sprites at each location, while the second part takes the output of the first part as input and predicts the final task's target binary event. The two-tiered MLP architecture, with a few tens of thousand examples, was able to learn the task perfectly, whereas all other algorithms (include unsupervised pre-training, but also traditional algorithms like SVMs, decision trees and boosting) all perform no better than chance. We hypothesize that the optimization difficulty involved when the intermediate pre-training is not performed is due to the {\em composition} of two highly non-linear tasks. Our findings are also consistent with hypotheses on cultural learning inspired by the observations of optimization problems with deep learning, presumably because of effective local minima.
Building on Deep Learning
Pickett, Marc (Naval Research Laboratory)
We propose using deep learning as the "workhorse" of a cognitive architecture. We show how deep learning can be leveraged to learn representations, such as a hierarchy of analogical schemas, from relational data. This approach to higher cognition drives some desiderata of deep learning, particularly modality independence and the ability to make top-down predictions. Finally, we consider the problem of how relational representations might be learned from sensor data that is not explicitly relational.