Reviews: Learning Safe Policies with Expert Guidance

Oct-7-2024, 22:52:53 GMT–Neural Information Processing Systems

Learning from demonstrations usually faces an ill-posed problem of inferring the expert reward functions. To facilitate safe learning from demonstrations, the paper formulates a maximin learning problem over a convex reward polytope, in order to guarantee that the worst possible consistent reward would yield a policy that is not much worse than optimal. The assumption is that the reward is linear in known features. The authors proposed two method: (i) ellipsoid method and (ii) follow-the-perturbed leader using separation oracles and a given MDP solver. The experiment is done in a grid world setting, and a modified version of the cart-pole problem.

experiment, expert guidance, learning safe policy, (6 more...)

Neural Information Processing Systems

Oct-7-2024, 22:52:53 GMT

Conferences Web Page

Add feedback

Industry:
- Education (0.39)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)