Backpropagation
New Hinton Nature Paper Revisits Backpropagation, Offers Insights for Understanding Learning in…
Although Turing awardee and backpropagation pioneer Geoffrey Hinton's interests have largely shifted to unsupervised learning, he recently co-authored a paper that takes a look back at backpropagation and explores its potential to contribute to understanding how the human cortex learns. Hinton and a team of researchers from DeepMind, University College London, and University of Oxford published the paper last Friday on Nature Reviews Neuroscience. Their main idea is that biological brains could compute effective synaptic updates by using feedback connections to induce neuron activities whose locally computed differences encode backpropagation-like error signals. Backpropagation of errors, or backprop, is a widely used algorithm in training artificial neural networks using gradient descent for supervised learning. The basics of continuous backpropagation were proposed in the 1960s, and in 1986 a Nature paper co-authored by Hinton showed experimentally that backprop can generate useful internal representations for neural networks.
Making Backpropagation, Autograd, MNIST Classifier from scratch in Python
Backpropagation (backward propagation of errors) -- is a widely used algorithm in training feedforward networks. It computes the gradient of the loss function with respect to the weights of the network. The main idea of it is to break big functions in small parts and use partial derivatives to get function derivative with using the Chain Rule. And because solving this can be a very hard task, here comes backpropagation and gradient descent(updating weights by a small amount based on the gradient to move in the way of loss minimization). Let's say we have 3 variables x -2, y 5, z -4, the result will be f -12, and our target for training is -13.
Backpropagation from scratch on Mini-Batches
You must be thinking, another Backprop from scratch blog? Well kinda yes but I thought this through and came up with something that you can use to tinker around along with easy to understand equations that you usually write down to understand the algorithm. This blog will focus on implementing the Backpropagation algorithm step-by-step on mini-batches of the dataset. There are plenty of tutorials and blogs to demonstrate the backpropagation algorithm in detail and all the logic behind calculus and algebra happening. So I'll skip that part and cut to equations in math and implementation using Python (coz why not).
Dithered backprop: A sparse and quantized backpropagation algorithm for more efficient deep neural network training
Wiedemann, Simon, Mehari, Temesgen, Kepp, Kevin, Samek, Wojciech
Deep Neural Networks are successful but highly computationally expensive learning systems. One of the main sources of time and energy drains is the well known backpropagation (backprop) algorithm, which roughly accounts for 2/3 of the computational complexity of training. In this work we propose a method for reducing the computational cost of backprop, which we named dithered backprop. It consists in applying a stochastic quantization scheme to intermediate results of the method. The particular quantisation scheme, called non-subtractive dither (NSD), induces sparsity which can be exploited by computing efficient sparse matrix multiplications. Experiments on popular image classification tasks show that it induces 92% sparsity on average across a wide set of models at no or negligible accuracy drop in comparison to state-of-the-art approaches, thus significantly reducing the computational complexity of the backward pass. Moreover, we show that our method is fully compatible to state-of-the-art training methods that reduce the bit-precision of training down to 8-bits, as such being able to further reduce the computational requirements. Finally we discuss and show potential benefits of applying dithered backprop in a distributed training setting, where both communication as well as compute efficiency may increase simultaneously with the number of participant nodes.
A Bayesian approach for initialization of weights in backpropagation neural net with application to character recognition
Murru, Nadir, Rossini, Rosaria
Convergence rate of training algorithms for neural networks is heavily affected by initialization of weights. In this paper, an original algorithm for initialization of weights in backpropagation neural net is presented with application to character recognition. The initialization method is mainly based on a customization of the Kalman filter, translating it into Bayesian statistics terms. A metrological approach is used in this context considering weights as measurements modeled by mutually dependent normal random variables. The algorithm performance is demonstrated by reporting and discussing results of simulation trials. Results are compared with random weights initialization and other methods. The proposed method shows an improved convergence rate for the backpropagation training algorithm.
Pipelined Backpropagation at Scale: Training Large Models without Batches
Kosson, Atli, Chiley, Vitaliy, Venigalla, Abhinav, Hestness, Joel, Köster, Urs
Parallelism is crucial for accelerating the training of deep neural networks. Pipeline parallelism can provide an efficient alternative to traditional data parallelism by allowing workers to specialize. Performing mini-batch SGD using pipeline parallelism has the overhead of filling and draining the pipeline. Pipelined Backpropagation updates the model parameters without draining the pipeline. This removes the overhead but introduces stale gradients and inconsistency between the weights used on the forward and backward passes, reducing final accuracy and the stability of training. We introduce Spike Compensation and Linear Weight Prediction to mitigate these effects. Analysis on a convex quadratic shows that both methods effectively counteract staleness. We train multiple convolutional networks at a batch size of one, completely replacing batch parallelism with fine-grained pipeline parallelism. With our methods, Pipelined Backpropagation achieves full accuracy on CIFAR-10 and ImageNet without hyperparameter tuning.
Memory-Efficient Backpropagation Through Time
Gruslys, Audrunas, Munos, Remi, Danihelka, Ivo, Lanctot, Marc, Graves, Alex
We propose a novel approach to reduce memory consumption of the backpropagation through time (BPTT) algorithm when training recurrent neural networks (RNNs). Our approach uses dynamic programming to balance a trade-off between caching of intermediate results and recomputation. The algorithm is capable of tightly fitting within almost any user-set memory budget while finding an optimal execution policy minimizing the computational cost. Computational devices have limited memory capacity and maximizing a computational performance given a fixed memory budget is a practical use-case. We provide asymptotic computational upper bounds for various regimes.
Spike-Train Level Backpropagation for Training Deep Recurrent Spiking Neural Networks
As an important class of SNNs, recurrent spiking neural networks (RSNNs) possess great computational power. However, the practical application of RSNNs is severely limited by challenges in training. Biologically-inspired unsupervised learning has limited capability in boosting the performance of RSNNs. On the other hand, existing backpropagation (BP) methods suffer from high complexity of unrolling in time, vanishing and exploding gradients, and approximate differentiation of discontinuous spiking activities when applied to RSNNs. To enable supervised training of RSNNs under a well-defined loss function, we present a novel Spike-Train level RSNNs Backpropagation (ST-RSBP) algorithm for training deep RSNNs.
Backpropagation-Friendly Eigendecomposition
Wang, Wei, Dang, Zheng, Hu, Yinlin, Fua, Pascal, Salzmann, Mathieu
Eigendecomposition (ED) is widely used in deep networks. However, the backpropagation of its results tends to be numerically unstable, whether using ED directly or approximating it with the Power Iteration method, particularly when dealing with large matrices. While this can be mitigated by partitioning the data in small and arbitrary groups, doing so has no theoretical basis and makes its impossible to exploit the power of ED to the full. In this paper, we introduce a numerically stable and differentiable approach to leveraging eigenvectors in deep networks. It can handle large matrices without requiring to split them.
A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation
Kusumoto, Mitsuru, Inoue, Takuya, Watanabe, Gentaro, Akiba, Takuya, Koyama, Masanori
Recomputation algorithms collectively refer to a family of methods that aims to reduce the memory consumption of the backpropagation by selectively discarding the intermediate results of the forward propagation and recomputing the discarded results as needed. In this paper, we will propose a novel and efficient recomputation method that can be applied to a wider range of neural nets than previous methods. We use the language of graph theory to formalize the general recomputation problem of minimizing the computational overhead under a fixed memory budget constraint, and provide a dynamic programming solution to the problem. Our method can reduce the peak memory consumption on various benchmark networks by $36\%\sim81\%$, which outperforms the reduction achieved by other methods. Papers published at the Neural Information Processing Systems Conference.