Universal Backpropagation in 15 lines of code - Ray Born - Medium


Backpropagation is the core of today's deep learning applications. If you've dabbled with deep learning there's a good chance you're aware of the concept. If you've ever implemented backpropagation manually, you're probably very grateful that deep learning libraries automatically do it for you. Implementing backprop by hand is arduous, yet the concept behind general backpropagation is very simple. Did you know that the eager execution-restricted backprop is almost trivial?

Backpropagation for people who are afraid of math


Backpropagation is one of the most important concepts in machine learning. There are many online resources that explain the intuition behind this algorithm (IMO the best of these is the backpropagation lecture in the Stanford cs231n video lectures. Another very good source, is this), but getting from the intuition to practice, can be (put gently) quite challenging. After spending more hours then i'd like to admit, trying to get all the sizes of my layers and weights to fit, constantly forgetting what's what, and what's connected where, I sat down and drew some diagrams that illustrates the entire process. Consider it a visual pseudocode.

Deep Learning with TensorFlow 2.0 [2019]


But what is that one special thing they have in common? They are all masters of deep learning. We often hear about AI, or self-driving cars, or the'algorithmic magic' at Google, Facebook, and Amazon. But it is not magic - it is deep learning. And more specifically, it is usually deep neural networks – the one algorithm to rule them all.

Biologically plausible deep learning -- but how far can we go with shallow networks? Machine Learning

Training deep neural networks with the error backpropagation algorithm is considered implausible from a biological perspective. Numerous recent publications suggest elaborate models for biologically plausible variants of deep learning, typically defining success as reaching around 98% test accuracy on the MNIST data set. Here, we investigate how far we can go on digit (MNIST) and object (CIFAR10) classification with biologically plausible, local learning rules in a network with one hidden layer and a single readout layer. The hidden layer weights are either fixed (random or random Gabor filters) or trained with unsupervised methods (PCA, ICA or Sparse Coding) that can be implemented by local learning rules. The readout layer is trained with a supervised, local learning rule. We first implement these models with rate neurons. This comparison reveals, first, that unsupervised learning does not lead to better performance than fixed random projections or Gabor filters for large hidden layers. Second, networks with localized receptive fields perform significantly better than networks with all-to-all connectivity and can reach backpropagation performance on MNIST. We then implement two of the networks - fixed, localized, random & random Gabor filters in the hidden layer - with spiking leaky integrate-and-fire neurons and spike timing dependent plasticity to train the readout layer. These spiking models achieve > 98.2% test accuracy on MNIST, which is close to the performance of rate networks with one hidden layer trained with backpropagation. The performance of our shallow network models is comparable to most current biologically plausible models of deep learning. Furthermore, our results with a shallow spiking network provide an important reference and suggest the use of datasets other than MNIST for testing the performance of future models of biologically plausible deep learning.

One LEGO at a Time Explaining the Math of how Neural Networks Learn with Implementation from…


A topic that is not always explained in-depth, despite its intuitive and modular nature, is the backpropagation technique responsible for updating trainable parameters. Let's build a neural network from scratch to see the internal functioning of a neural network using LEGO pieces as a modular analogy, one brick at a time. The above figure depicts some of the Math used for training a neural network. We will make sense of this during this article. At this point, these operations only compute a general linear system, which doesn't have the capacity to model non-linear interactions.

Associated Learning: Decomposing End-to-end Backpropagation based on Auto-encoders and Target Propagation Machine Learning

Backpropagation has been widely used in deep learning approaches, but it is inefficient and sometimes unstable because of backward locking and vanishing/exploding gradient problems, especially when the gradient flow is long. Additionally, updating all edge weights based on a single objective seems biologically implausible. In this paper, we introduce a novel biologically motivated learning structure called Associated Learning, which modularizes the network into smaller components, each of which has a local objective. Because the objectives are mutually independent, Associated Learning can learn the parameters independently and simultaneously when these parameters belong to different components. Surprisingly, training deep models by Associated Learning yields comparable accuracies to models trained using typical backpropagation methods, which aims at fitting the target variable directly. Moreover, probably because the gradient flow of each component is short, deep networks can still be trained with Associated Learning even when some of the activation functions are sigmoid-a situation that usually results in the vanishing gradient problem when using typical backpropagation. We also found that the Associated Learning generates better metafeatures, which we demonstrated both quantitatively (via inter-class and intra-class distance comparisons in the hidden layers) and qualitatively (by visualizing the hidden layers using t-SNE).

Sampling-Free Variational Inference of Bayesian Neural Networks by Variance Backpropagation Machine Learning

We propose a new Bayesian Neural Net formulation that affords variational inference for which the evidence lower bound is analytically tractable subject to a tight approximation. We achieve this tractability by (i) decomposing ReLU nonlinearities into the product of an identity and a Heaviside step function, (ii) introducing a separate path that decomposes the neural net expectation from its variance. We demonstrate formally that introducing separate latent binary variables to the activations allows representing the neural network likelihood as a chain of linear operations. Performing variational inference on this construction enables a sampling-free computation of the evidence lower bound which is a more effective approximation than the widely applied Monte Carlo sampling and CLT related techniques. We evaluate the model on a range of regression and classification tasks against BNN inference alternatives, showing competitive or improved performance over the current state-of-the-art.

A Brief Summary of Maths Behind RNN (Recurrent Neural Networks)


In a feedforward neural network, we have X(input) and H(Hidden) and y(output). We can have as many hidden layers as we want but weights (W)for every hidden layer are and the weights for every neuron corresponding to the input are different. Above we have weights Wh0 and Wh1, which corresponds to two different layers, while Wh00, Wh01 and so on, represent different weights corresponding to different neuron and with respect to the input. The RNN cell contains a set of feed forward neural networks cause we have time steps. The RNN has sequential input, sequential output, multiple time-steps, and multiple hidden layers. Unlike FFNN, here we calculate hidden layer values not only from input values but also previous time step values and Weights ( W) at hidden layers are the same for time steps. Here is the complete picture for RNN and its Math.

Residual Flows for Invertible Generative Modeling Machine Learning

Flow-based generative models parameterize probability distributions through an invertible transformation and can be trained by maximum likelihood. Invertible residual networks provide a flexible family of transformations where only Lipschitz conditions rather than strict architectural constraints are needed for enforcing invertibility. However, prior work trained invertible residual networks for density estimation by relying on biased log-density estimates whose bias increased with the network's expressiveness. We give a tractable unbiased estimate of the log density, and reduce the memory required during training by a factor of ten. Furthermore, we improve invertible residual blocks by proposing the use of activation functions that avoid gradient saturation and generalizing the Lipschitz condition to induced mixed norms. The resulting approach, called Residual Flows, achieves state-of-the-art performance on density estimation amongst flow-based models, and outperforms networks that use coupling blocks at joint generative and discriminative modeling.

Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks Machine Learning

We present practical Levenberg-Marquardt variants of Gauss-Newton and natural gradient methods for solving non-convex optimization problems that arise in training deep neural networks involving enormous numbers of variables and huge data sets. Our methods use subsampled Gauss-Newton or Fisher information matrices and either subsampled gradient estimates (fully stochastic) or full gradients (semi-stochastic), which, in the latter case, we prove convergent to a stationary point. By using the Sherman-Morrison-Woodbury formula with automatic differentiation (backpropagation) we show how our methods can be implemented to perform efficiently. Finally, numerical results are presented to demonstrate the effectiveness of our proposed methods.