Flow Q-Learning
Park, Seohong, Li, Qiyang, Levine, Sergey
–arXiv.org Artificial Intelligence
However, leveraging flow or diffusion models to parameterize Offline reinforcement learning (RL) enables training an effective policies for offline RL is not a trivial problem. Unlike decision-making policy from a previously collected with simpler policy classes, such as Gaussian policies, there dataset without costly environment interactions (Lange et al., is no straightforward way to train the flow or diffusion policies 2012; Levine et al., 2020). The essence of offline RL to maximize a learned value function, due to the iterative is constrained optimization: the agent must maximize returns nature of these generative models. This is an example while staying within the dataset's state-action distribution of a policy extraction problem, which is known to be a key (Levine et al., 2020). As datasets have grown larger and challenge in offline RL in general (Park et al., 2024a). Previous more diverse (Collaboration et al., 2024), their behavioral works have devised diverse ways to extract an iterative distributions have become more complex and multimodal, generative policy from a learned value function, based and this often necessitates an expressive policy class (Mandlekar on weighted regression, reparameterized policy gradient, rejection et al., 2021) capable of capturing these complex distributions sampling, and other techniques. While they have and implementing a more precise behavioral constraint.
arXiv.org Artificial Intelligence
Feb-4-2025
- Country:
- North America > United States > California (0.14)
- Genre:
- Research Report > New Finding (0.46)
- Technology: