Backpropagation
Reviews: Hybrid Macro/Micro Level Backpropagation for Training Deep Spiking Neural Networks
The paper presents a new approach for training spiking neural networks (SNNs) via backpropagation. The main contribution is a decomposition of the error gradient into a "macro" component that directly optimizes the rate-coded error, and a "micro" component that takes into account effects due to spike timing. The approach therefore creates a bridge between approaches that treat SNNs mainly like standard ANNs (optimizing the rate-coded error), and algorithms that work on the spike level such as SpikeProp, which have previously not been able to scale to larger problems. There are similarities to previously published methods such as from Lee et al. 2016 or Wu et al. 2017, which are discussed briefly, but there are novel contributions. Furthermore, the results on two tasks (MNIST and N-MNIST) show improvements over the state-of-the-art, although those improvements are small.
Reviews: Backpropagation with Callbacks: Foundations for Efficient and Expressive Differentiable Programming
The paper descries Lantern, a framework for automatic differentiation in Scala, based on callbacks and continuation passing style. It compares against PyTorch and TensorFlow on several benchmark tasks. There are two main aspects of the paper: Reverse-mode automatic differentiation with continuations, and code generation via multi-stage programming. The submission does not provide code for the proposed framework, which I don't find acceptable for a paper on a software package. It's unclear to me how the first is different from any other implementation of automatic differentiation via operator overloading.
Reviews: Dendritic cortical microcircuits approximate the backpropagation algorithm
Using two compartments allows errors and activities to be represented within the same neuron. The overall procedure is similar to contrastive Hebbian learning and relies on weak top down feedback from an initial'self-predicting' settled state, but unlike contrastive Hebbian learning does not require separate phases. Experimental results show that the method can attain reasonable results on MNIST. Major comments: This paper presents an interesting approach to approximately implementing backpropagation that relies on a mixture of dendritic compartments and specific circuitry motifs. This is a fundamentally important topic and the results would likely be of interest to many, even if the specific hypothesis turns out to be incorrect.
The Reversible Residual Network: Backpropagation Without Storing Activations
Aidan N. Gomez, Mengye Ren, Raquel Urtasun, Roger B. Grosse
Deep residual networks (ResNets) have significantly pushed forward the state-ofthe-art on image classification, increasing in performance as networks grow both deeper and wider. However, memory consumption becomes a bottleneck, as one needs to store the activations in order to calculate gradients using backpropagation. We present the Reversible Residual Network (RevNet), a variant of ResNets where each layer's activations can be reconstructed exactly from the next layer's. Therefore, the activations for most layers need not be stored in memory during backpropagation. We demonstrate the effectiveness of RevNets on CIFAR-10, CIFAR-100, and ImageNet, establishing nearly identical classification accuracy to equally-sized ResNets, even though the activation storage requirements are independent of depth.
On Blackbox Backpropagation and Jacobian Sensing
Krzysztof M. Choromanski, Vikas Sindhwani
From a small number of calls to a given "blackbox" on random input perturbations, we show how to efficiently recover its unknown Jacobian, or estimate the left action of its Jacobian on a given vector. Our methods are based on a novel combination of compressed sensing and graph coloring techniques, and provably exploit structural prior knowledge about the Jacobian such as sparsity and symmetry while being noise robust. We demonstrate efficient backpropagation through noisy blackbox layers in a deep neural net, improved data-efficiency in the task of linearizing the dynamics of a rigid body system, and the generic ability to handle a rich class of input-output dependency structures in Jacobian estimation problems.
FastLRNR and Sparse Physics Informed Backpropagation
Cho, Woojin, Lee, Kookjin, Park, Noseong, Rim, Donsub, Welper, Gerrit
We introduce Sparse Physics Informed Backpropagation (SPInProp), a new class of methods for accelerating backpropagation for a specialized neural network architecture called Low Rank Neural Representation (LRNR). The approach exploits the low rank structure within LRNR and constructs a reduced neural network approximation that is much smaller in size. We call the smaller network FastLRNR. We show that backpropagation of FastLRNR can be substituted for that of LRNR, enabling a significant reduction in complexity. We apply SPInProp to a physics informed neural networks framework and demonstrate how the solution of parametrized partial differential equations is accelerated.
Self-attention as an attractor network: transient memories without backpropagation
D'Amico, Francesco, Negri, Matteo
Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.
Less Memory Means smaller GPUs: Backpropagation with Compressed Activations
Barley, Daniel, Fröning, Holger
The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements. Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators, such as GPUs or TPUs. Next to the vast number of floating point operations the memory footprint of DNNs is also exploding. In contrast, GPU architectures are notoriously short on memory. Even comparatively small architectures like some EfficientNet variants cannot be trained on a single consumer-grade GPU at reasonable mini-batch sizes. During training, intermediate input activations have to be stored until backpropagation for gradient calculation. These make up the vast majority of the memory footprint. In this work we therefore consider compressing activation maps for the backward pass using pooling, which can reduce both the memory footprint and amount of data movement. The forward computation remains uncompressed. We empirically show convergence and study effects on feature detection at the example of the common vision architecture ResNet. With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule, while maintaining prediction accuracy compared to an uncompressed baseline.
Advancing On-Device Neural Network Training with TinyPropv2: Dynamic, Sparse, and Efficient Backpropagation
Rüb, Marcus, Sikora, Axel, Mueller-Gritschneder, Daniel
This study introduces TinyPropv2, an innovative algorithm optimized for on-device learning in deep neural networks, specifically designed for low-power microcontroller units. TinyPropv2 refines sparse backpropagation by dynamically adjusting the level of sparsity, including the ability to selectively skip training steps. This feature significantly lowers computational effort without substantially compromising accuracy. Our comprehensive evaluation across diverse datasets CIFAR 10, CIFAR100, Flower, Food, Speech Command, MNIST, HAR, and DCASE2020 reveals that TinyPropv2 achieves near-parity with full training methods, with an average accuracy drop of only around 1 percent in most cases. For instance, against full training, TinyPropv2's accuracy drop is minimal, for example, only 0.82 percent on CIFAR 10 and 1.07 percent on CIFAR100. In terms of computational effort, TinyPropv2 shows a marked reduction, requiring as little as 10 percent of the computational effort needed for full training in some scenarios, and consistently outperforms other sparse training methodologies. These findings underscore TinyPropv2's capacity to efficiently manage computational resources while maintaining high accuracy, positioning it as an advantageous solution for advanced embedded device applications in the IoT ecosystem.
Rethinking Deep Learning: Propagating Information in Neural Networks without Backpropagation and Statistical Optimization
Developing strong AI signifies the arrival of technological singularity, contributing greatly to advancing human civilization and resolving social issues. Neural networks (NNs) and deep learning, which utilize NNs, are expected to lead to strong AI due to their biological neural system-mimicking structures. However, the statistical weight optimization techniques commonly used, such as error backpropagation and loss functions, may hinder the mimicry of neural systems. This study discusses the information propagation capabilities and potential practical applications of NNs as neural system mimicking structures by solving the handwritten character recognition problem in the Modified National Institute of Standards and Technology (MNIST) database without using statistical weight optimization techniques like error backpropagation. In this study, the NNs architecture comprises fully connected layers using step functions as activation functions, with 0-15 hidden layers, and no weight updates. The accuracy is calculated by comparing the average output vectors of the training data for each label with the output vectors of the test data, based on vector similarity. The results showed that the maximum accuracy achieved is around 80%. This indicates that NNs can propagate information correctly without using statistical weight optimization. Additionally, the accuracy decreased with an increasing number of hidden layers. This is attributed to the decrease in the variance of the output vectors as the number of hidden layers increases, suggesting that the output data becomes smooth. This study's NNs and accuracy calculation methods are simple and have room for various improvements. Moreover, creating a feedforward NNs that repeatedly cycles through 'input -> processing -> output -> environmental response -> input -> ...' could pave the way for practical software applications.