Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control Huayu Chen 1,2
–Neural Information Processing Systems
Drawing upon recent advances in language model alignment, we formulate offline Reinforcement Learning as a two-stage optimization problem: First pretraining expressive generative policies on reward-free behavior datasets, then fine-tuning these policies to align with task-specific annotations like Q-values. This strategy allows us to leverage abundant and diverse behavior data to enhance generalization and enable rapid adaptation to downstream tasks using minimal annotations. In particular, we introduce Efficient Diffusion Alignment (EDA) for solving continuous control problems.
Neural Information Processing Systems
Mar-27-2025, 10:42:56 GMT