At the heart of deep learning lies a hard optimization problem. So hard that for several decades after the introduction of neural networks, the difficulty of optimization on deep neural networks was a barrier to their mainstream usage and contributed to their decline in the 1990s and 2000s. Since then, we have overcome this issue. In this post, I explore the "hardness" in optimizing neural networks and see what the theory has to say. In a nutshell: the deeper the network becomes, the harder the optimization problem becomes.
Traditionally, deep learning algorithms update the network weights whereas the network architecture is chosen manually, using a process of trial and error. In this work, we propose two novel approaches that automatically update the network structure while also learning its weights. The novelty of our approach lies in our parameterization where the depth, or additional complexity, is encapsulated continuously in the parameter space through control parameters that add additional complexity. We propose two methods: In tunnel networks, this selection is done at the level of a hidden unit, and in budding perceptrons, this is done at the level of a network layer; updating this control parameter introduces either another hidden unit or another hidden layer. We show the effectiveness of our methods on the synthetic two-spirals data and on two real data sets of MNIST and MIRFLICKR, where we see that our proposed methods, with the same set of hyperparameters, can correctly adjust the network complexity to the task complexity.
Training neural networks is traditionally done by providing a sequence of random mini-batches sampled uniformly from the entire training data. In this work, we analyze the effects of curriculum learning, which involves the dynamic non-uniform sampling of mini-batches, on the training of deep networks, and specifically CNNs trained on image recognition. To employ curriculum learning, the training algorithm must resolve 2 problems: (i) sort the training examples by difficulty; (ii) compute a series of mini-batches that exhibit an increasing level of difficulty. We address challenge (i) using two methods: transfer learning from some competitive "teacher" network, and bootstrapping. We show that both methods show similar benefits in terms of increased learning speed and improved final performance on test data. We address challenge (ii) by investigating different pacing functions to guide the sampling. The empirical investigation includes a variety of network architectures, using images from CIFAR-10, CIFAR-100 and subsets of ImageNet. We conclude with a novel theoretical analysis of curriculum learning, where we show how it effectively modifies the optimization landscape. We then define the concept of an ideal curriculum, and show that under mild conditions it does not change the corresponding global minimum of the optimization function.
The mushroom body is the key network for the representation of learned olfactory stimuli in Drosophila and insects. The sparse activity of Kenyon cells, the principal neurons in the mushroom body, plays a key role in the learned classification of different odours. In the specific case of the fruit fly, the sparseness of the network is enforced by an inhibitory feedback neuron called APL, and by an intrinsic high firing threshold of the Kenyon cells. In this work we took inspiration from the fruit fly brain to formulate a novel machine learning algorithm that is able to optimize the sparsity level of a reservoir by changing the firing thresholds of the nodes. The sparsity is only applied on the readout layer so as not to change the timescales of the reservoir and to allow the derivation of a one-layer update rule for the firing thresholds. The proposed algorithm is a combination of learning a neuron-specific sparsity threshold via gradient descent and a global sparsity threshold via a Markov chain Monte Carlo method. The proposed model outperforms the standard gradient descent, which is limited to the readout weights of the reservoir, on two example tasks. It demonstrates how the learnt sparse representation can lead to better classification performance, memorization ability and convergence time.
The inverse of the Fisher information matrix is used in the natural gradientdescent algorithm to train single-layer and multi-layer perceptrons. We have discovered a new scheme to represent the Fisher information matrix of a stochastic multi-layer perceptron. Based on this scheme, we have designed an algorithm to compute the natural gradient. When the input dimension n is much larger than the number of hidden neurons, the complexity of this algorithm isof order O(n). It is confirmed by simulations that the natural gradient descent learning rule is not only efficient but also robust. 1 INTRODUCTION The inverse of the Fisher information matrix is required to find the Cramer-Rae lower bound to analyze the performance of an unbiased estimator. It is also needed in the natural gradient learning framework (Amari, 1997) to design statistically efficient algorithms for estimating parameters in general and for training neural networks in particular. In this paper, we assume a stochastic model for multilayer perceptrons.Considering a Riemannian parameter space in which the Fisher information matrix is a metric tensor, we apply the natural gradient learning rule to train single-layer and multi-layer perceptrons. The main difficulty encountered is to compute the inverse of the Fisher information matrix of large dimensions when the input dimension is high. By exploring the structure of the Fisher information matrix and its inverse, we design a fast algorithm with lower complexity to implement the natural gradient learning algorithm.