Goto

Collaborating Authors

 minimaxregret




Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment

Zhou, Weichao, Li, Wenchao

arXiv.org Artificial Intelligence

Many imitation learning (IL) algorithms use inverse reinforcement learning (IRL) to infer a reward function that aligns with the demonstrations. However, the inferred reward function often fails to capture the underlying task objective. In this paper, we propose a novel framework for IRL-based IL that prioritizes task alignment over conventional data alignment. Our framework is a semi-supervised approach that leverages expert demonstrations as weak supervision signals to derive a set of candidate reward functions that align with the task rather than only with the data. It adopts an adversarial mechanism to train a policy with this set of reward functions to gain a collective validation of the policy's ability to accomplish the task. We provide theoretical insights into this framework's ability to mitigate task-reward misalignment and present a practical implementation. Our experimental results show that our framework outperforms conventional IL baselines in complex and transfer learning scenarios. The complete code are available at https://github.com/zwc662/PAGAR.


PAGAR: Taming Reward Misalignment in Inverse Reinforcement Learning-Based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward

Zhou, Weichao, Li, Wenchao

arXiv.org Artificial Intelligence

Many imitation learning (IL) algorithms employ inverse reinforcement learning (IRL) to infer the intrinsic reward function that an expert is implicitly optimizing for based on their demonstrated behaviors. However, in practice, IRL-based IL can fail to accomplish the underlying task due to a misalignment between the inferred reward and the objective of the task. In this paper, we address the susceptibility of IL to such misalignment by introducing a semi-supervised reward design paradigm called Protagonist Antagonist Guided Adversarial Reward (PAGAR). PAGAR-based IL trains a policy to perform well under mixed reward functions instead of a single reward function as in IRL-based IL. We identify the theoretical conditions under which PAGAR-based IL can avoid the task failures caused by reward misalignment. We also present a practical on-and-off policy approach to implementing PAGAR-based IL. Experimental results show that our algorithm outperforms standard IL baselines in complex tasks and challenging transfer settings.