gaussian policy
IsBang-BangControlAllYouNeed? SolvingContinuousControlwithBernoulliPolicies
Real-world robotics tasks commonly manifest ascontrol problems overcontinuous action spaces. When learning to act in such settings, control policies are typically represented as continuous probability distributions that cover all feasible control inputs - often Gaussians. The underlying assumption is that this enables more refined decisions compared to crude policy choices such as discretized controllers, which limit the search space but induce abrupt changes. While switching controls canbeundesirable inpractice astheymaychallenge stability andaccelerate system weardown, they are theoretically feasible and even arise as optimal strategies in some settings.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > Russia (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies. Furthermore, the multimodality of diffusion policies also shows the potential of providing the agent with enhanced exploration capabilities. However, existing works mainly focus on applying diffusion policies in offline RL, while their incorporation into online RL has been less investigated. The diffusion model's training objective, known as the variational lower bound, cannot be applied directly in online RL due to the unavailability of'good' samples (actions).
Policy Transfer for Continuous-Time Reinforcement Learning: A (Rough) Differential Equation Approach
This paper studies policy transfer, one of the well-known transfer learning techniques adopted in large language models, for two classes of continuous-time reinforcement learning problems. In the first class of continuous-time linear-quadratic systems with Shannon's entropy regularization (a.k.a. LQRs), we fully exploit the Gaussian structure of their optimal policy and the stability of their associated Riccati equations. In the second class where the system has possibly non-linear and bounded dynamics, the key technical component is the stability of diffusion SDEs which is established by invoking the rough path theory. Our work provides the first theoretical proof of policy transfer for continuous-time RL: an optimal policy learned for one RL problem can be used to initialize the search for a near-optimal policy in a closely related RL problem, while maintaining the convergence rate of the original algorithm. To illustrate the benefit of policy transfer for RL, we propose a novel policy learning algorithm for continuous-time LQRs, which achieves global linear convergence and local super-linear convergence. As a byproduct of our analysis, we derive the stability of a concrete class of continuous-time score-based diffusion models via their connection with LQRs.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
convergence of several policy gradient methods, whose novelty is summarized in Lines 210-212 and further explained
R1.1 ...these analysis mainly come from the existing work...the novelty is very limited. Our proposed SRVR-NPG has a better complexity than SRVR-PG (Remark 4.13). We believed our theoretical contrition already has archival value. R1.3 Reproducibility: We believe that all of our theoretical claims have been proved. Please refer to [34] for a detailed proof.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Los Angeles County > Santa Monica (0.04)
- (3 more...)
Flow Matching Policy Gradients
McAllister, David, Ge, Songwei, Yi, Brent, Kim, Chung Min, Weber, Ethan, Choi, Hongsuk, Feng, Haiwen, Kanazawa, Angjoo
Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)