Goto

Collaborating Authors

 exact gradient


Biologically-plausiblebackpropagationthrough arbitrarytimespansvialocalneuromodulators

Neural Information Processing Systems

Here, we propose that extra-synaptic diffusion of local neuromodulators such as neuropeptides may afford an effective mode of backpropagation lying within the bounds of biological plausibility.


Symplectic Adjoint Method for Exact Gradient of Neural ODE with Minimal Memory

Neural Information Processing Systems

A neural network model of a differential equation, namely neural ODE, has enabled the learning of continuous-time dynamical systems and probabilistic distributions with high accuracy. The neural ODE uses the same network repeatedly during a numerical integration. The memory consumption of the backpropagation algorithm is proportional to the number of uses times the network size. This is true even if a checkpointing scheme divides the computation graph into sub-graphs.


On Training Implicit Models

Neural Information Processing Systems

This paper focuses on training implicit models of infinite layers. Specifically, previous works employ implicit differentiation and solve the exact gradient for the backward propagation. However, is it necessary to compute such an exact but expensive gradient for training?


Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints

Yang, Jing, Cai, Kaitong, Fan, Yijia, Yang, Yufeng, Wang, Keze

arXiv.org Artificial Intelligence

Full fine-tuning of Large Language Models (LLMs) is notoriously memory-intensive, primarily because conventional optimizers such as SGD or Adam assume access to exact gradients derived from cached activations. Existing solutions either alter the model architecture (e.g., reversible networks) or trade memory for computation (e.g., activation checkpointing), but the optimizer itself remains untouched. In this work, we introduce GradLite, a backward-friendly optimizer that relaxes the requirement of exact gradients, enabling efficient training even when intermediate activations are aggressively discarded or approximated. GradLite leverages two key techniques: (i) low-rank Jacobian approximation, which reduces the dimensionality of backpropagated error signals, and (ii) error-feedback correction, which accumulates and compensates approximation errors across iterations to preserve convergence guarantees. We provide a theoretical analysis showing that GradLite maintains unbiased gradient estimates with bounded variance, ensuring convergence rates comparable to Adam. Empirically, GradLite reduces optimizer-state and activation memory consumption by up to 50\% without architectural changes, and achieves on-par or superior downstream performance on reasoning (MMLU, GSM8K), multilingual, and dialogue benchmarks compared to checkpointing and optimizer-centric baselines (LoMo, GaLore).


A Complete Pipeline for deploying SNNs with Synaptic Delays on Loihi 2

Mészáros, Balázs, Knight, James C., Timcheck, Jonathan, Nowotny, Thomas

arXiv.org Artificial Intelligence

Abstract--Spiking Neural Networks are attracting increased attention as a more energy-efficient alternative to traditional Artificial Neural Networks for edge computing. Neuromorphic computing can significantly reduce energy requirements. Here, we present a complete pipeline: efficient event-based training of SNNs with synaptic delays on GPUs and deployment on Intel's Loihi 2 neuromorphic chip. We evaluate our approach on keyword recognition tasks using the Spiking Heidelberg Digits and Spiking Speech Commands datasets, demonstrating that our algorithm can enhance classification accuracy compared to architectures without delays. Our benchmarking indicates almost no accuracy loss between GPU and Loihi 2 implementations, while classification on Loihi 2 is up to 18 faster and uses 250 less energy than on an NVIDIA Jetson Orin Nano.





Exact Gradients for Stochastic Spiking Neural Networks Driven by Rough Signals

Neural Information Processing Systems

We introduce a mathematically rigorous framework based on rough path theory to model stochastic spiking neural networks (SSNNs) as stochastic differential equations with event discontinuities (Event SDEs) and driven by càdlàg rough paths. Our formalism is general enough to allow for potential jumps to be present both in the solution trajectories as well as in the driving noise. We then identify a set of sufficient conditions ensuring the existence of pathwise gradients of solution trajectories and event times with respect to the network's parameters and show how these gradients satisfy a recursive relation. Furthermore, we introduce a general-purpose loss function defined by means of a new class of signature kernels indexed on càdlàg rough paths and use it to train SSNNs as generative models. We provide an end-to-end autodifferentiable solver for Event SDEs and make its implementation available as part of the \texttt{diffrax} library.


Accelerated Mini-batch Randomized Block Coordinate Descent Method

Tuo Zhao, Mo Yu, Yiming Wang, Raman Arora, Han Liu

Neural Information Processing Systems

We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained.