exact gradient
Symplectic Adjoint Method for Exact Gradient of Neural ODE with Minimal Memory
A neural network model of a differential equation, namely neural ODE, has enabled the learning of continuous-time dynamical systems and probabilistic distributions with high accuracy. The neural ODE uses the same network repeatedly during a numerical integration. The memory consumption of the backpropagation algorithm is proportional to the number of uses times the network size. This is true even if a checkpointing scheme divides the computation graph into sub-graphs.
Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints
Yang, Jing, Cai, Kaitong, Fan, Yijia, Yang, Yufeng, Wang, Keze
Full fine-tuning of Large Language Models (LLMs) is notoriously memory-intensive, primarily because conventional optimizers such as SGD or Adam assume access to exact gradients derived from cached activations. Existing solutions either alter the model architecture (e.g., reversible networks) or trade memory for computation (e.g., activation checkpointing), but the optimizer itself remains untouched. In this work, we introduce GradLite, a backward-friendly optimizer that relaxes the requirement of exact gradients, enabling efficient training even when intermediate activations are aggressively discarded or approximated. GradLite leverages two key techniques: (i) low-rank Jacobian approximation, which reduces the dimensionality of backpropagated error signals, and (ii) error-feedback correction, which accumulates and compensates approximation errors across iterations to preserve convergence guarantees. We provide a theoretical analysis showing that GradLite maintains unbiased gradient estimates with bounded variance, ensuring convergence rates comparable to Adam. Empirically, GradLite reduces optimizer-state and activation memory consumption by up to 50\% without architectural changes, and achieves on-par or superior downstream performance on reasoning (MMLU, GSM8K), multilingual, and dialogue benchmarks compared to checkpointing and optimizer-centric baselines (LoMo, GaLore).
A Complete Pipeline for deploying SNNs with Synaptic Delays on Loihi 2
Mészáros, Balázs, Knight, James C., Timcheck, Jonathan, Nowotny, Thomas
Abstract--Spiking Neural Networks are attracting increased attention as a more energy-efficient alternative to traditional Artificial Neural Networks for edge computing. Neuromorphic computing can significantly reduce energy requirements. Here, we present a complete pipeline: efficient event-based training of SNNs with synaptic delays on GPUs and deployment on Intel's Loihi 2 neuromorphic chip. We evaluate our approach on keyword recognition tasks using the Spiking Heidelberg Digits and Spiking Speech Commands datasets, demonstrating that our algorithm can enhance classification accuracy compared to architectures without delays. Our benchmarking indicates almost no accuracy loss between GPU and Loihi 2 implementations, while classification on Loihi 2 is up to 18 faster and uses 250 less energy than on an NVIDIA Jetson Orin Nano.
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Europe > United Kingdom (0.04)
- Information Technology (0.67)
- Energy (0.49)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > Canada (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.95)
Exact Gradients for Stochastic Spiking Neural Networks Driven by Rough Signals
We introduce a mathematically rigorous framework based on rough path theory to model stochastic spiking neural networks (SSNNs) as stochastic differential equations with event discontinuities (Event SDEs) and driven by càdlàg rough paths. Our formalism is general enough to allow for potential jumps to be present both in the solution trajectories as well as in the driving noise. We then identify a set of sufficient conditions ensuring the existence of pathwise gradients of solution trajectories and event times with respect to the network's parameters and show how these gradients satisfy a recursive relation. Furthermore, we introduce a general-purpose loss function defined by means of a new class of signature kernels indexed on càdlàg rough paths and use it to train SSNNs as generative models. We provide an end-to-end autodifferentiable solver for Event SDEs and make its implementation available as part of the \texttt{diffrax} library.
Accelerated Mini-batch Randomized Block Coordinate Descent Method
Tuo Zhao, Mo Yu, Yiming Wang, Raman Arora, Han Liu
We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)