Goto

Collaborating Authors

 Reinforcement Learning


Mo' States Mo' Problems: Emergency Stop Mechanisms from Observation

Neural Information Processing Systems

In many environments, only a relatively small subset of the complete state space is necessary in order to accomplish a given task. We develop a simple technique using emergency stops (e-stops) to exploit this phenomenon. Using e-stops significantly improves sample complexity by reducing the amount of required exploration, while retaining a performance bound that efficiently trades off the rate of convergence with a small asymptotic sub-optimality gap. We analyze the regret behavior of e-stops and present empirical results in discrete and continuous settings demonstrating that our reset mechanism can provide order-of-magnitude speedups on top of existing reinforcement learning methods.


Discovery of Useful Questions as Auxiliary Tasks

Neural Information Processing Systems

Arguably, intelligent agents ought to be able to discover their own questions so that in learning answers for them they learn unanticipated useful knowledge and skills; this departs from the focus in much of machine learning on agents learning answers to externally defined questions. We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value functions or GVFs, a fairly rich form of knowledge representation. Specifically, our method uses non-myopic meta-gradients to learn GVF-questions such that learning answers to them, as an auxiliary task, induces useful representations for the main task faced by the RL agent. We demonstrate that auxiliary tasks based on the discovered GVFs are sufficient, on their own, to build representations that support main task learning, and that they do so better than popular hand-designed auxiliary tasks from the literature. Furthermore, we show, in the context of Atari2600 videogames, how such auxiliary tasks, meta-learned alongside the main task, can improve the data efficiency of an actor-critic agent.


Optimism Stabilizes Thompson Sampling for Adaptive Inference

arXiv.org Machine Learning

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.


Variance Reduction Based Experience Replay for Policy Optimization

arXiv.org Machine Learning

Effective reinforcement learning (RL) for complex stochastic systems requires leveraging historical data collected in previous iterations to accelerate policy optimization. Classical experience replay treats all past observations uniformly and fails to account for their varying contributions to learning. To overcome this limitation, we propose Variance Reduction Experience Replay (VRER), a principled framework that selectively reuses informative samples to reduce variance in policy gradient estimation. VRER is algorithm-agnostic and integrates seamlessly with existing policy optimization methods, forming the basis of our sample-efficient off-policy algorithm, Policy Gradient with VRER (PG-VRER). Motivated by the lack of rigorous theoretical analysis of experience replay, we develop a novel framework that explicitly captures dependencies introduced by Markovian dynamics and behavior-policy interactions. Using this framework, we establish finite-time convergence guarantees for PG-VRER and reveal a fundamental bias-variance trade-off: reusing older experience increases bias but simultaneously reduces gradient variance. Extensive empirical experiments demonstrate that VRER consistently accelerates policy learning and improves performance over state-of-the-art policy optimization algorithms.


TorsionNet: A Reinforcement Learning Approach to Sequential Conformer Search

Neural Information Processing Systems

Molecular geometry prediction of flexible molecules, or conformer search, is a long-standing challenge in computational chemistry. This task is of great importance for predicting structure-activity relationships for a wide variety of substances ranging from biomolecules to ubiquitous materials. Substantial computational resources are invested in Monte Carlo and Molecular Dynamics methods to generate diverse and representative conformer sets for medium to large molecules, which are yet intractable to chemoinformatic conformer search methods. We present TorsionNet, an efficient sequential conformer search technique based on reinforcement learning under the rigid rotor approximation. The model is trained via curriculum learning, whose theoretical benefit is explored in detail, to maximize a novel metric grounded in thermodynamics called the Gibbs Score. Our experimental results show that TorsionNet outperforms the highest-scoring chemoinformatics method by 4x on large branched alkanes, and by several orders of magnitude on the previously unexplored biopolymer lignin, with applications in renewable energy. TorsionNet also outperforms the far more exhaustive but computationally intensive Self-Guided Molecular Dynamics sampling method.


Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Neural Information Processing Systems

Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, often discounted reward formulations are used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large.


The Role of Target Update Frequencies in Q-Learning

arXiv.org Machine Learning

The target network update frequency (TUF) is a central stabilization mechanism in (deep) Q-learning. However, their selection remains poorly understood and is often treated merely as another tunable hyperparameter rather than as a principled design decision. This work provides a theoretical analysis of target fixing in tabular Q-learning through the lens of approximate dynamic programming. We formulate periodic target updates as a nested optimization scheme in which each outer iteration applies an inexact Bellman optimality operator, approximated by a generic inner loop optimizer. Rigorous theory yields a finite-time convergence analysis for the asynchronous sampling setting, specializing to stochastic gradient descent in the inner loop. Our results deliver an explicit characterization of the bias-variance trade-off induced by the target update period, showing how to optimally set this critical hyperparameter. We prove that constant target update schedules are suboptimal, incurring a logarithmic overhead in sample complexity that is entirely avoidable with adaptive schedules. Our analysis shows that the optimal target update frequency increases geometrically over the course of the learning process.


Self-Paced Deep Reinforcement Learning

Neural Information Processing Systems

Curriculum reinforcement learning (CRL) improves the learning speed and stability of an agent by exposing it to a tailored series of tasks throughout learning. Despite empirical successes, an open question in CRL is how to automatically generate a curriculum for a given reinforcement learning (RL) agent, avoiding manual design. In this paper, we propose an answer by interpreting the curriculum generation as an inference problem, where distributions over tasks are progressively learned to approach the target task. This approach leads to an automatic curriculum generation, whose pace is controlled by the agent, with solid theoretical motivation and easily integrated with deep RL algorithms. In the conducted experiments, the curricula generated with the proposed algorithm significantly improve learning performance across several environments and deep RL algorithms, matching or outperforming state-of-the-art existing CRL algorithms.


Notes on the Reward Representation of Posterior Updates

arXiv.org Machine Learning

Many ideas in modern control and reinforcement learning treat decision-making as inference: start from a baseline distribution and update it when a signal arrives. We ask when this can be made literal rather than metaphorical. We study the special case where a KL-regularized soft update is exactly a Bayesian posterior inside a single fixed probabilistic model, so the update variable is a genuine channel through which information is transmitted. In this regime, behavioral change is driven only by evidence carried by that channel: the update must be explainable as an evidence reweighing of the baseline. This yields a sharp identification result: posterior updates determine the relative, context-dependent incentive signal that shifts behavior, but they do not uniquely determine absolute rewards, which remain ambiguous up to context-specific baselines. Requiring one reusable continuation value across different update directions adds a further coherence constraint linking the reward descriptions associated with different conditioning orders.


Stabilizing Fixed-Point Iteration for Markov Chain Poisson Equations

arXiv.org Machine Learning

Poisson equations underpin average-reward reinforcement learning, but beyond ergodicity they can be ill-posed, meaning that solutions are non-unique and standard fixed point iterations can oscillate on reducible or periodic chains. We study finite-state Markov chains with $n$ states and transition matrix $P$. We show that all non-decaying modes are captured by a real peripheral invariant subspace $\mathcal{K}(P)$, and that the induced operator on the quotient space $\mathbb{R}^n/\mathcal{K}(P)$ is strictly contractive, yielding a unique quotient solution. Building on this viewpoint, we develop an end-to-end pipeline that learns the chain structure, estimates an anchor based gauge map, and runs projected stochastic approximation to estimate a gauge-fixed representative together with an associated peripheral residual. We prove $\widetilde{O}(T^{-1/2})$ convergence up to projection estimation error, enabling stable Poisson equation learning for multichain and periodic regimes with applications to performance evaluation of average-reward reinforcement learning beyond ergodicity.