Goto

Collaborating Authors

 gaussian policy


IsBang-BangControlAllYouNeed? SolvingContinuousControlwithBernoulliPolicies

Neural Information Processing Systems

Real-world robotics tasks commonly manifest ascontrol problems overcontinuous action spaces. When learning to act in such settings, control policies are typically represented as continuous probability distributions that cover all feasible control inputs - often Gaussians. The underlying assumption is that this enables more refined decisions compared to crude policy choices such as discretized controllers, which limit the search space but induce abrupt changes. While switching controls canbeundesirable inpractice astheymaychallenge stability andaccelerate system weardown, they are theoretically feasible and even arise as optimal strategies in some settings.



Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

Neural Information Processing Systems

Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies. Furthermore, the multimodality of diffusion policies also shows the potential of providing the agent with enhanced exploration capabilities. However, existing works mainly focus on applying diffusion policies in offline RL, while their incorporation into online RL has been less investigated. The diffusion model's training objective, known as the variational lower bound, cannot be applied directly in online RL due to the unavailability of'good' samples (actions).



Policy Transfer for Continuous-Time Reinforcement Learning: A (Rough) Differential Equation Approach

Guo, Xin, Lyu, Zijiu

arXiv.org Artificial Intelligence

This paper studies policy transfer, one of the well-known transfer learning techniques adopted in large language models, for two classes of continuous-time reinforcement learning problems. In the first class of continuous-time linear-quadratic systems with Shannon's entropy regularization (a.k.a. LQRs), we fully exploit the Gaussian structure of their optimal policy and the stability of their associated Riccati equations. In the second class where the system has possibly non-linear and bounded dynamics, the key technical component is the stability of diffusion SDEs which is established by invoking the rough path theory. Our work provides the first theoretical proof of policy transfer for continuous-time RL: an optimal policy learned for one RL problem can be used to initialize the search for a near-optimal policy in a closely related RL problem, while maintaining the convergence rate of the original algorithm. To illustrate the benefit of policy transfer for RL, we propose a novel policy learning algorithm for continuous-time LQRs, which achieves global linear convergence and local super-linear convergence. As a byproduct of our analysis, we derive the stability of a concrete class of continuous-time score-based diffusion models via their connection with LQRs.



convergence of several policy gradient methods, whose novelty is summarized in Lines 210-212 and further explained

Neural Information Processing Systems

R1.1 ...these analysis mainly come from the existing work...the novelty is very limited. Our proposed SRVR-NPG has a better complexity than SRVR-PG (Remark 4.13). We believed our theoretical contrition already has archival value. R1.3 Reproducibility: We believe that all of our theoretical claims have been proved. Please refer to [34] for a detailed proof.




Flow Matching Policy Gradients

McAllister, David, Ge, Songwei, Yi, Brent, Kim, Chung Min, Weber, Ethan, Choi, Hongsuk, Feng, Haiwen, Kanazawa, Angjoo

arXiv.org Artificial Intelligence

Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.