Goto

Collaborating Authors

 Backpropagation


Efficient Low-rank Backpropagation for Vision Transformer Adaptation

Neural Information Processing Systems

The increasing scale of vision transformers (ViT) has made the efficient fine-tuning of these large models for specific needs a significant challenge in various applications. This issue originates from the computationally demanding matrix multiplications required during the backpropagation process through linear layers in ViT.In this paper, we tackle this problem by proposing a new Low-rank BackPropagation via Walsh-Hadamard Transformation (LBP-WHT) method. Intuitively, LBP-WHT projects the gradient into a low-rank space and carries out backpropagation. This approach substantially reduces the computation needed for adapting ViT, as matrix multiplication in the low-rank space is far less resource-intensive. We conduct extensive experiments with different models (ViT, hybrid convolution-ViT model) on multiple datasets to demonstrate the effectiveness of our method.


Temporal Spike Sequence Learning via Backpropagation for Deep Spiking Neural Networks

Neural Information Processing Systems

Spiking neural networks (SNNs) are well suited for spatio-temporal learning and implementations on energy-efficient event-driven neuromorphic processors. However, existing SNN error backpropagation (BP) methods lack proper handling of spiking discontinuities and suffer from low performance compared with the BP methods for traditional artificial neural networks. In addition, a large number of time steps are typically required to achieve decent performance, leading to high latency and rendering spike-based computation unscalable to deep architectures. We present a novel Temporal Spike Sequence Learning Backpropagation (TSSL-BP) method for training deep SNNs, which breaks down error backpropagation across two types of inter-neuron and intra-neuron dependencies and leads to improved temporal learning precision. TSSL-BP efficiently trains deep SNNs within a much shortened temporal window of a few steps while improving the accuracy for various image classification datasets including CIFAR10.


Bridging Discrete and Backpropagation: Straight-Through and Beyond

Neural Information Processing Systems

Backpropagation, the cornerstone of deep learning, is limited to computing gradients for continuous variables. This limitation poses challenges for problems involving discrete latent variables. To address this issue, we propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables. First, we examine the widely used Straight-Through (ST) heuristic and demonstrate that it works as a first-order approximation of the gradient. Guided by our findings, we propose ReinMax, which achieves second-order accuracy by integrating Heun's method, a second-order numerical method for solving ODEs.


GAIT-prop: A biologically plausible learning rule derived from backpropagation of error

Neural Information Processing Systems

Traditional backpropagation of error, though a highly successful algorithm for learning in artificial neural network models, includes features which are biologically implausible for learning in real neural circuits. An alternative called target propagation proposes to solve this implausibility by using a top-down model of neural activity to convert an error at the output of a neural network into layer-wise and plausible'targets' for every unit. These targets can then be used to produce weight updates for network training. However, thus far, target propagation has been heuristically proposed without demonstrable equivalence to backpropagation. Here, we derive an exact correspondence between backpropagation and a modified form of target propagation (GAIT-prop) where the target is a small perturbation of the forward pass.


Backpropagation-Friendly Eigendecomposition

Neural Information Processing Systems

Eigendecomposition (ED) is widely used in deep networks. However, the backpropagation of its results tends to be numerically unstable, whether using ED directly or approximating it with the Power Iteration method, particularly when dealing with large matrices. While this can be mitigated by partitioning the data in small and arbitrary groups, doing so has no theoretical basis and makes its impossible to exploit the power of ED to the full. In this paper, we introduce a numerically stable and differentiable approach to leveraging eigenvectors in deep networks. It can handle large matrices without requiring to split them. We demonstrate the better robustness of our approach over standard ED and PI for ZCA whitening, an alternative to batch normalization, and for PCA denoising, which we introduce as a new normalization strategy for deep networks, aiming to further denoise the network's features.


Attention-Gated Brain Propagation: How the brain can implement reward-based error backpropagation

Neural Information Processing Systems

Much recent work has focused on biologically plausible variants of supervised learning algorithms. However, there is no teacher in the motor cortex that instructs the motor neurons and learning in the brain depends on reward and punishment. We demonstrate a biologically plausible reinforcement learning scheme for deep networks with an arbitrary number of layers. The network chooses an action by selecting a unit in the output layer and uses feedback connections to assign credit to the units in successively lower layers that are responsible for this action. After the choice, the network receives reinforcement and there is no teacher correcting the errors.


Deep inference of latent dynamics with spatio-temporal super-resolution using selective backpropagation through time

Neural Information Processing Systems

Modern neural interfaces allow access to the activity of up to a million neurons within brain circuits. However, bandwidth limits often create a trade-off between greater spatial sampling (more channels or pixels) and the temporal frequency of sampling. Here we demonstrate that it is possible to obtain spatio-temporal super-resolution in neuronal time series by exploiting relationships among neurons, embedded in latent low-dimensional population dynamics. Our novel neural network training strategy, selective backpropagation through time (SBTT), enables learning of deep generative models of latent dynamics from data in which the set of observed variables changes at each time step. The resulting models are able to infer activity for missing samples by combining observations with learned latent dynamics.


Numerical influence of ReLU'(0) on backpropagation

Neural Information Processing Systems

In theory, the choice of ReLU(0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU'(0) for several precision levels (16, 32, 64 bits), on various networks (fully connected, VGG, ResNet) and datasets (MNIST, CIFAR10, SVHN, ImageNet). We observe considerable variations of backpropagation outputs which occur around half of the time in 32 bits precision. The effect disappears with double precision, while it is systematic at 16 bits.


Reviews: The Reversible Residual Network: Backpropagation Without Storing Activations

Neural Information Processing Systems

The authors introduce "RevNets", which avoid storing (some) activations by utilizing computational blocks that are trivial to invert (i.e. Revnets match the performance of ResNets with the same number of parameters, and in practice RevNets appear to save 4X in storage at the cost of a 2X increase in computation. Interestingly, the reversible blocks are also volume preserving, which is not explicitly discussed, but should be, because this is a potential limitation. The approach of reconstructing activations rather than storing them is only applicable to invertible layers, and so while requiring only O(1) storage for invertible layers, succeeds in only a 4X gain in storage requirements (which is nevertheless impressive). One concern I have is that the recent work on decoupled neural interfaces (DNI) is not adequately discussed or compared to (DNI also requires O(1) storage, and estimates error signals [and optionally input values] analogously to how value functions are learned in reinforcement learning).


Reviews: On Blackbox Backpropagation and Jacobian Sensing

Neural Information Processing Systems

The paper focuses on automatic differentiation of multiple-variable vector-valued functioned in the context of training model parameters from input data. A standard approach in a number of popular environments rely on a propagation of the errors in the evaluation of the model parameters so far using backpropagation of the estimation error as computed of the (local) gradient of the loss function. In this paper, the emphasis is on settings where some of the operators of the model are exogenous blackboxes for which the gradient cannot be computed explicitly and one resorts to finite differencing of the function of interest. Such approach can be prohibitively expensive if the Jacobian does not have some special structure that can be exploited. The strategy pursued in this paper consists in exploiting the relationship between graph colouring and Jacobian estimation.