Goto

Collaborating Authors

 continuous rl


Appendix A Continuous RL: Formulation and Well-Posedness 467 A.1 Exploratory Stochastic-Control

Neural Information Processing Systems

Assumption 2. The following conditions are assumed throughout: A; (32) (iv) r has polynomial growth in x and a, i.e., there exists a constant C > 0 and ยต 1 such that To do so, let's assume Theorem 6. Assume that for a policy ฯ€ and for every x, Assumption 3. Assume the following conditions hold: Lemma 9. Let ฯ€, ห† ฯ€ be two feedback policies. We need a lemma for the perturbation bounds. Here we present a detailed version of the CPPO algorithm. D.3 below, which clearly illustrates the advantage of square-root KL divergence.



A Temporal Difference Method for Stochastic Continuous Dynamics

arXiv.org Artificial Intelligence

For continuous systems modeled by dynamical equations such as ODEs and SDEs, Bellman's Principle of Optimality takes the form of the Hamilton-Jacobi-Bellman (HJB) equation, which provides the theoretical target of reinforcement learning (RL). Although recent advances in RL successfully leverage this formulation, the existing methods typically assume the underlying dynamics are known a priori because they need explicit access to the coefficient functions of dynamical equations to update the value function following the HJB equation. We address this inherent limitation of HJB-based RL; we propose a model-free approach still targeting the HJB equation and propose the corresponding temporal difference method. We establish exponential convergence of the idealized continuous-time dynamics and empirically demonstrate its potential advantages over transition-kernel-based formulations. The proposed formulation paves the way toward bridging stochastic control and model-free reinforcement learning.



Adapting Double Q-Learning for Continuous Reinforcement Learning

arXiv.org Artificial Intelligence

Majority of off-policy reinforcement learning algorithms use overestimation bias control techniques. Most of these techniques rooted in heuristics, primarily addressing the consequences of overestimation rather than its fundamental origins. In this work we present a novel approach to the bias correction, similar in spirit to Double Q-Learning. We propose using a policy in form of a mixture with two components. Each policy component is maximized and assessed by separate networks, which removes any basis for the overestimation bias. Our approach shows promising near-SOTA results on a small set of MuJoCo environments.