Learning a Diffusion Model Policy from Rewards via Q-Score Matching

Psenka, Michael, Escontrela, Alejandro, Abbeel, Pieter, Ma, Yi

Dec-18-2023–arXiv.org Artificial Intelligence

Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we focus on off-policy reinforcement learning and propose a new method for learning a diffusion model policy that exploits the linked structure between the score of the policy and the action gradient of the Q-function. We denote this method Q-score matching and provide theoretical justification for this approach. We conduct experiments in simulated environments to demonstrate the effectiveness of our proposed method and compare to popular baselines.

diffusion model, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

Dec-18-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > Alameda County > Berkeley (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)