vdpo
Variational Delayed Policy Optimization
However, state-of-the-art (SOT A) RL techniques with Temporal-Difference (TD) learning frameworks often suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. To improve learning efficiency without sacrificing performance, this work introduces a novel framework called V ariational Delayed Policy Optimization (VDPO), which reformulates delayed RL as a variational inference problem. This problem is further modelled as a two-step iterative optimization problem, where the first step is TD learning in the delay-free environment with a small state space, and the second step is behaviour cloning which can be addressed much more efficiently than TD learning. We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOT A methods, with a significant enhancement of sample efficiency (approximately 50% less amount of samples) in the MuJoCo benchmark.
Variational Delayed Policy Optimization
In environments with delayed observation, state augmentation by including actions within the delay window is adopted to retrieve Markovian property to enable reinforcement learning (RL). Whereas, state-of-the-art (SOTA) RL techniques with Temporal-Difference (TD) learning frameworks commonly suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. To improve the learning efficiency without sacrificing performance, this work novelly introduces Variational Delayed Policy Optimization (VDPO), reforming delayed RL as a variational inference problem. This problem is further modelled as a two-step iterative optimization problem, where the first step is TD learning in the delay-free environment with a small state space, and the second step is behaviour cloning which can be addressed much more efficiently than TD learning. We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOTA methods, with a significant enhancement of sample efficiency (approximately 50\% less amount of samples) in the MuJoCo benchmark.
Variational Delayed Policy Optimization
However, state-of-the-art (SOT A) RL techniques with Temporal-Difference (TD) learning frameworks often suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. To improve learning efficiency without sacrificing performance, this work introduces a novel framework called V ariational Delayed Policy Optimization (VDPO), which reformulates delayed RL as a variational inference problem. This problem is further modelled as a two-step iterative optimization problem, where the first step is TD learning in the delay-free environment with a small state space, and the second step is behaviour cloning which can be addressed much more efficiently than TD learning. We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOT A methods, with a significant enhancement of sample efficiency (approximately 50% less amount of samples) in the MuJoCo benchmark.
Variational Delayed Policy Optimization
In environments with delayed observation, state augmentation by including actions within the delay window is adopted to retrieve Markovian property to enable reinforcement learning (RL). Whereas, state-of-the-art (SOTA) RL techniques with Temporal-Difference (TD) learning frameworks commonly suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. To improve the learning efficiency without sacrificing performance, this work novelly introduces Variational Delayed Policy Optimization (VDPO), reforming delayed RL as a variational inference problem. This problem is further modelled as a two-step iterative optimization problem, where the first step is TD learning in the delay-free environment with a small state space, and the second step is behaviour cloning which can be addressed much more efficiently than TD learning. We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOTA methods, with a significant enhancement of sample efficiency (approximately 50\% less amount of samples) in the MuJoCo benchmark.
VPO: Leveraging the Number of Votes in Preference Optimization
Cho, Jae Hyeon, Park, Minkyung, Lee, Byung-Jun
Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets are typically created by selecting preferred sentences through a voting process involving multiple individuals, as opinions can vary due to the subjective nature of human preferences. While the number of votes offers insight into whether a sentence pair is clearly preferable or controversial, current methods do not fully leverage this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferable to another. Using this estimated probability as a target, we develop the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs. We show that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms.
- South America > Brazil (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Variational Delayed Policy Optimization
Wu, Qingyuan, Zhan, Simon Sinong, Wang, Yixuan, Wang, Yuhui, Lin, Chung-Wei, Lv, Chen, Zhu, Qi, Huang, Chao
In environments with delayed observation, state augmentation by including actions within the delay window is adopted to retrieve Markovian property to enable reinforcement learning (RL). However, state-of-the-art (SOTA) RL techniques with Temporal-Difference (TD) learning frameworks often suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. To improve learning efficiency without sacrificing performance, this work introduces a novel framework called Variational Delayed Policy Optimization (VDPO), which reformulates delayed RL as a variational inference problem. This problem is further modelled as a two-step iterative optimization problem, where the first step is TD learning in the delay-free environment with a small state space, and the second step is behaviour cloning which can be addressed much more efficiently than TD learning. We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOTA methods, with a significant enhancement of sample efficiency (approximately 50\% less amount of samples) in the MuJoCo benchmark.