Goto

Collaborating Authors

 reset


Staggered Environment Resets Improve Massively Parallel On-Policy Reinforcement Learning

Neural Information Processing Systems

Massively parallel GPU simulation environments have accelerated reinforcement learning (RL) research by enabling fast data collection for on-policy RL algorithms like Proximal Policy Optimization (PPO). To maximize throughput, it is common to use short rollouts per policy update, increasing the update-to-data (UTD) ratio. However, we find that, in this setting, standard synchronous resets introduce harmful nonstationarity, skewing the learning signal and destabilizing training. We introduce staggered resets, a simple yet effective technique where environments are initialized and reset at varied points within the task horizon. This yields training batches with greater temporal diversity, reducing the nonstationarity induced by synchronized rollouts. We characterize dimensions along which RL environments can benefit significantly from staggered resets through illustrative toy environments. We then apply this technique to challenging high-dimensional robotics environments, achieving significantly higher sample efficiency, faster wall-clock convergence, and stronger final performance. Finally, this technique scales better with more parallel environments compared to naive synchronized rollouts.


Robust and Scalable Autonomous Reinforcement Learning in Irreversible Environments

Neural Information Processing Systems

Reinforcement learning (RL) typically assumes repetitive resets to provide an agent with diverse and unbiased experiences. These resets require significant human intervention and result in poor training efficiency in real-world settings.


Overleaf Example

Neural Information Processing Systems

Although Federated Learning (FL) is promising for privacy-preserving collaborative model training, it suffers from low inference performance due to heterogeneous client data. Due to heterogeneous data across clients, FL training easily learns client-specific overfitting features. Existing FL methods adopt coarsegrained averaging, which can easily cause the global model to get stuck in local optima, leading to poor generalization. Specifically, this paper presents a novel FL framework, FedPhoenix, to address this issue. It stochastically resets partial parameters in each round to destroy some features of the global model, guiding FL training to learn multiple generalized features for inference rather than specific overfitting features. Experimental results on various wellknown datasets demonstrate that compared to SOTAFL methods, FedPhoenix can achieve up to 20.73% higher accuracy. The implementation is publicly available at https://github.com/UniString/FedPhoenix.


Ridge Regression from Poisson Resetting: A Renewal Perspective on Spectral Regularization

arXiv.org Machine Learning

We connect stochastic resetting from non-equilibrium statistical physics with ridge regularization in statistical learning. For linear gradient flow, resetting to the origin at rate $r$ produces stationary mean $(X^\top X+rI)^{-1}X^\top y$, exactly the ridge estimator with penalty $λ=r$. This uses the known Laplace-transform relationship between ridge regression and exponential-time averaging of gradient flow, with the exponential time now interpreted as the stationary age associated with Poisson resetting. We then extend this identity to general renewal reset laws: the exponential reset time distribution is the unique renewal law whose stationary mean reproduces scalar ridge in every eigendirection as an exact filter identity for every positive curvature, while non-exponential renewal laws generate alternative spectral filters. At the fluctuation level, we study a separate additive Ornstein-Uhlenbeck extension with constant diffusion, interpreted as a stylized SGD approximation. In this setting, the equality holds only at the level of the mean, since the reset process has a nonzero stationary covariance from accumulated OU noise and reset-timing variance, whereas deterministic ridge is a fixed estimator with the same center. Stylized experiments compare the deterministic renewal-induced filters directly and illustrate when filters induced by non-exponential reset-time laws can differ predictively from ridge. The results for the stationary mean and the induced spectral filters are established for continuous-time gradient flow with isotropic resetting on quadratic objectives; the covariance and risk formulas additionally assume additive noise with state-independent covariance.




Resetting the Optimizer in Deep RL: An Empirical Study

Neural Information Processing Systems

We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of solving a sequence of optimization problems where the loss function changes per iteration. The common approach to solving this sequence of problems is to employ modern variants of the stochastic gradient descent algorithm such as Adam. These optimizers maintain their own internal parameters such as estimates of the first-order and the second-order moments of the gradient, and update them over time. Therefore, information obtained in previous iterations is used to solve the optimization problem in the current iteration. We demonstrate that this can contaminate the moment estimates because the optimization landscape can change arbitrarily from one iteration to the next one. To hedge against this negative effect, a simple idea is to reset the internal parameters of the optimizer when starting a new iteration. We empirically investigate this resetting idea by employing various optimizers in conjunction with the Rainbow algorithm. We demonstrate that this simple modification significantly improves the performance of deep RL on the Atari benchmark.


Appendix A Control algorithm The action-value function can be decomposed into two components as: Q (PT) (s, a) = Q (P) (s, a) + Q (T) w

Neural Information Processing Systems

We use induction to prove this statement. The penultimate step follows from the induction hypothesis completing the proof. Then, the fixed point of Eq.(5) is the value function of in f M . We focus on permanent value function in the next two theorems. The permanent value function is updated using Eq.