gradient magnitude
Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps
In reinforcement learning (RL), it is common to apply techniques used broadly in machine learning such as neural network function approximators and momentum-based optimizers. However, such tools were largely developed for supervised learning rather than nonstationary RL, leading practitioners to adopt target networks, clipped policy updates, and other RL-specific implementation tricks to combat this mismatch, rather than directly adapting this toolchain for use in RL. In this paper, we take a different approach and instead address the effect of nonstationarity by adapting the widely used Adam optimiser. We first analyse the impact of nonstationary gradient magnitude --- such as that caused by a change in target network --- on Adam's update size, demonstrating that such a change can lead to large updates and hence sub-optimal performance.To address this, we introduce Adam-Rel.Rather than using the global timestep in the Adam update, Adam-Rel uses the timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes.We demonstrate that this avoids large updates and reduces to learning rate annealing in the absence of such increases in gradient magnitude. Evaluating Adam-Rel in both on-policy and off-policy RL, we demonstrate improved performance in both Atari and Craftax.We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.
Time-Reversed Dissipation Induces Duality Between Minimizing Gradient Norm and Function Value
In convex optimization, first-order optimization methods efficiently minimizing function values have been a central subject study since Nesterov's seminal work of 1983. Recently, however, Kim and Fessler's OGM-G and Lee et al.'s FISTA-G have been presented as alternatives that efficiently minimize the gradient magnitude instead. In this paper, we present H-duality, which represents a surprising one-to-one correspondence between methods efficiently minimizing function values and methods efficiently minimizing gradient magnitude.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
d9812f756d0df06c7381945d2e2c7d4b-AuthorFeedback.pdf
We thank the four reviewers for their constructive comments. The following are our responses to reviewers' comments. We will rewrite the formulations in the revision. The classifier is trained with lr in {2, 5, 10} and bs=256 for 50 epochs. Best accuracy of the classifier is reported). NNP (we test s=25 and 64), and original authors' GitHub for MS.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
A Observations in Local Memory Similarity
We observed local memory's similarity through Q-Q (quantile-quantile) plots as shown in Figure In Figure A1(a), the linearity of the points in Q-Q plot suggests that the worker 1's local This is consistent to our observations in pairwise cosine distance shown in Figure 2(a). This indicates that we can possibly use local worker's top-k One variant of Y oung's inequality is k x + y k A.1 global minimum of f ( x) 2, The quadrilateral identity is h x, y i = 1 2 k x k We provided the following table to explain section 3's main results and connected them to other parts of paper. Our theorem 1 shows this; indicates its applicability in distributed training. Lemma1: contraction property Lemma2: contraction in distributed setting Theorem1: ScaleCom's convergence rate same as SGD ( 1 / p T) Intuition Higher correlation between workers brings CL T - k closer to true top-k Require positive correlation between workers in distr. Fig.2 and 3 show high correlation so our contraction is close to true top-k Fig.2 and 3 show positive correlation between workers Table 1,2 (Fig4,5) verified ScaleCom's convergence same as baseline Each node is equipped with 2 IBM Power 9 processors clocked at 3.15 GHz.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Adaptive Surrogate Gradients for Sequential Reinforcement Learning in Spiking Neural Networks
Berghe, Korneel Van den, Stroobants, Stein, Reddi, Vijay Janapa, de Croon, G. C. H. E.
Neuromorphic computing systems are set to revolutionize energy-constrained robotics by achieving orders-of-magnitude efficiency gains, while enabling native temporal processing. Spiking Neural Networks (SNNs) represent a promising algorithmic approach for these systems, yet their application to complex control tasks faces two critical challenges: (1) the non-differentiable nature of spiking neurons necessitates surrogate gradients with unclear optimization properties, and (2) the stateful dynamics of SNNs require training on sequences, which in reinforcement learning (RL) is hindered by limited sequence lengths during early training, preventing the network from bridging its warm-up period. We address these challenges by systematically analyzing surrogate gradient slope settings, showing that shallower slopes increase gradient magnitude in deeper layers but reduce alignment with true gradients. In supervised learning, we find no clear preference for fixed or scheduled slopes. The effect is much more pronounced in RL settings, where shallower slopes or scheduled slopes lead to a 2.1x improvement in both training and final deployed performance. Next, we propose a novel training approach that leverages a privileged guiding policy to bootstrap the learning process, while still exploiting online environment interactions with the spiking policy. Combining our method with an adaptive slope schedule for a real-world drone position control task, we achieve an average return of 400 points, substantially outperforming prior techniques, including Behavioral Cloning and TD3BC, which achieve at most --200 points under the same conditions. This work advances both the theoretical understanding of surrogate gradient learning in SNNs and practical training methodologies for neuromorphic controllers demonstrated in real-world robotic systems.
- Europe > Netherlands > South Holland > Delft (0.04)
- North America > United States (0.04)