Backpropagation
Learning Discrete Directed Acyclic Graphs via Backpropagation
Wren, Andrew J., Minervini, Pasquale, Franceschi, Luca, Zantedeschi, Valentina
Recently continuous relaxations have been proposed in order to learn Directed Acyclic Graphs (DAGs) from data by backpropagation, instead of using combinatorial optimization. However, a number of techniques for fully discrete backpropagation could instead be applied. In this paper, we explore that direction and propose DAG-DB, a framework for learning DAGs by Discrete Backpropagation. Based on the architecture of Implicit Maximum Likelihood Estimation [I-MLE, arXiv:2106.01798], DAG-DB adopts a probabilistic approach to the problem, sampling binary adjacency matrices from an implicit probability distribution. DAG-DB learns a parameter for the distribution from the loss incurred by each sample, performing competitively using either of two fully discrete backpropagation techniques, namely I-MLE and Straight-Through Estimation.
Scaling Laws Beyond Backpropagation
Filipovich, Matthew J., Cappelli, Alessandro, Hesslow, Daniel, Launay, Julien
Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment (DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.
Single-phase deep learning in cortico-cortical networks
Greedy, Will, Zhu, Heng Wei, Pemberton, Joseph, Mellor, Jack, Costa, Rui Ponte
The error-backpropagation (backprop) algorithm remains the most common solution to the credit assignment problem in artificial neural networks. In neuroscience, it is unclear whether the brain could adopt a similar strategy to correctly modify its synapses. Recent models have attempted to bridge this gap while being consistent with a range of experimental observations. However, these models are either unable to effectively backpropagate error signals across multiple layers or require a multi-phase learning process, neither of which are reminiscent of learning in the brain. Here, we introduce a new model, Bursting Cortico-Cortical Networks (BurstCCN), which solves these issues by integrating known properties of cortical networks namely bursting activity, short-term plasticity (STP) and dendrite-targeting interneurons. BurstCCN relies on burst multiplexing via connection-type-specific STP to propagate backprop-like error signals within deep cortical networks. These error signals are encoded at distal dendrites and induce burst-dependent plasticity as a result of excitatory-inhibitory top-down inputs. First, we demonstrate that our model can effectively backpropagate errors through multiple layers using a single-phase learning process. Next, we show both empirically and analytically that learning in our model approximates backprop-derived gradients. Finally, we demonstrate that our model is capable of learning complex image classification tasks (MNIST and CIFAR-10). Overall, our results suggest that cortical features across sub-cellular, cellular, microcircuit and systems levels jointly underlie single-phase efficient deep learning in the brain.
Gradient Backpropagation based Feature Attribution to Enable Explainable-AI on the Edge
Bhat, Ashwin, Assoa, Adou Sangbone, Raychowdhury, Arijit
There has been a recent surge in the field of Explainable AI (XAI) which tackles the problem of providing insights into the behavior of black-box machine learning models. Within this field, \textit{feature attribution} encompasses methods which assign relevance scores to input features and visualize them as a heatmap. Designing flexible accelerators for multiple such algorithms is challenging since the hardware mapping of these algorithms has not been studied yet. In this work, we first analyze the dataflow of gradient backpropagation based feature attribution algorithms to determine the resource overhead required over inference. The gradient computation is optimized to minimize the memory overhead. Second, we develop a High-Level Synthesis (HLS) based configurable FPGA design that is targeted for edge devices and supports three feature attribution algorithms. Tile based computation is employed to maximally use on-chip resources while adhering to the resource constraints. Representative CNNs are trained on CIFAR-10 dataset and implemented on multiple Xilinx FPGAs using 16-bit fixed-point precision demonstrating flexibility of our library. Finally, through efficient reuse of allocated hardware resources, our design methodology demonstrates a pathway to repurpose inference accelerators to support feature attribution with minimal overhead, thereby enabling real-time XAI on the edge.
Belief propagation generalizes backpropagation
The two most important algorithms in artificial intelligence are backpropagation and belief propagation. In spite of their importance, the connection between them is poorly characterized. We show that when an input to backpropagation is converted into an input to belief propagation so that (loopy) belief propagation can be run on it, then the result of belief propagation encodes the result of backpropagation; thus backpropagation is recovered as a special case of belief propagation. In other words, we prove for apparently the first time that belief propagation generalizes backpropagation. Our analysis is a theoretical contribution, which we motivate with the expectation that it might reconcile our understandings of each of these algorithms, and serve as a guide to engineering researchers seeking to improve the behavior of systems that use one or the other.
Backpropagation of Simple Expression
Now let's do the forward pass using simple basic operations Now if we want to check our forward pass using Digraph then we can simply do it using the function that we have already created. After this code, we will get a very easy-to-understand graph, even with just the below anyone can easily understand what is happening below the table. Now, as we are done with forward pass here comes the turn of the backward pass. Before that let's add a backward function into our Digraph code and again visualize the graph. Let's visualize the graph again As soon as we hear the word backward pass then our mind already knew that we are going to deal with backpropagation.
Optimization without Backpropagation
Forward gradients have been recently introduced to bypass backpropagation in autodifferentiation, while retaining unbiased estimators of true gradients. We derive an optimality condition to obtain best approximating forward gradients, which leads us to mathematical insights that suggest optimization in high dimension is challenging with forward gradients. Our extensive experiments on test functions support this claim.
Backpropagation Chain Rule
The chain rule is a fundamental result in calculus. Besides being a handy tool for computing derivatives in calculus homework, the chain rule is closely related to the backpropagation algorithm that is widely-used for computing derivatives (gradients) in neural network training. This blog post by Boaz Barak is a beautiful tutorial on the chain rule and the backpropagation algorithm. As in Barak's post, the backpropagation algorithm is usually taught as an application of the chain rule in machine learning classes. This leads to a common belief that "backpropagation is just applying the chain rule repeatedly".
Backpropagation at the Infinitesimal Inference Limit of Energy-Based Models: Unifying Predictive Coding, Equilibrium Propagation, and Contrastive Hebbian Learning
Millidge, Beren, Song, Yuhang, Salvatori, Tommaso, Lukasiewicz, Thomas, Bogacz, Rafal
How the brain performs credit assignment is a fundamental unsolved problem in neuroscience. Many `biologically plausible' algorithms have been proposed, which compute gradients that approximate those computed by backpropagation (BP), and which operate in ways that more closely satisfy the constraints imposed by neural circuitry. Many such algorithms utilize the framework of energy-based models (EBMs), in which all free variables in the model are optimized to minimize a global energy function. However, in the literature, these algorithms exist in isolation and no unified theory exists linking them together. Here, we provide a comprehensive theory of the conditions under which EBMs can approximate BP, which lets us unify many of the BP approximation results in the literature (namely, predictive coding, equilibrium propagation, and contrastive Hebbian learning) and demonstrate that their approximation to BP arises from a simple and general mathematical property of EBMs at free-phase equilibrium. This property can then be exploited in different ways with different energy functions, and these specific choices yield a family of BP-approximating algorithms, which both includes the known results in the literature and can be used to derive new ones.
Replacing Backpropagation with Biological Plausible Top-down Credit Assignment in Deep Neural Networks Training
Chen, Jian-Hui, Wang, Zuoren, Liu, Cheng-Lin
Top-down connections in the biological brain has been shown to be important in high cognitive functions. However, the function of this mechanism in machine learning has not been defined clearly. In this study, we propose to lay out a framework constituted by a bottom-up and a top-down network. Here, we use a Top-down Credit Assignment Network (TDCA-network) to replace the loss function and back propagation (BP) which serve as the feedback mechanism in traditional bottom-up network training paradigm. Our results show that the credit given by well-trained TDCA-network outperforms the gradient from backpropagation in classification task under different settings on multiple datasets. In addition, we successfully use a credit diffusing trick, which can keep training and testing performance remain unchanged, to reduce parameter complexity of the TDCA-network. More importantly, by comparing their trajectories in the parameter landscape, we find that TDCA-network directly achieved a global optimum, in contrast to that backpropagation only can gain a localized optimum. Thus, our results demonstrate that TDCA-network not only provide a biological plausible learning mechanism, but also has the potential to directly achieve global optimum, indicating that top-down credit assignment can substitute backpropagation, and provide a better learning framework for Deep Neural Networks.