SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations
–Neural Information Processing Systems
We consider offline safe imitation learning (IL), where the agent aims to learn the safe policy that mimics preferred behavior while avoiding non-preferred behavior from non-preferred demonstrations and unlabeled demonstrations. This problem setting corresponds to various real-world scenarios, where satisfying safety constraints is more important than maximizing the expected return. However, it is very challenging to learn the policy to avoid constraint-violating (i.e. In this paper, we present a hyperparameter-free offline safe IL algorithm, SafeDICE, that learns safe policy by leveraging the non-preferred demonstrations in the space of stationary distributions. Our algorithm directly estimates the stationary distribution corrections of the policy that imitate the demonstrations excluding the non-preferred behavior.
Neural Information Processing Systems
Jan-20-2025, 01:53:19 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.85)
- Robots (0.70)
- Information Technology > Artificial Intelligence