Goto

Collaborating Authors

 adversarial soft advantage fitting


Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Neural Information Processing Systems

Adversarial Imitation Learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator. This alternated optimization is known to be delicate in practice since it compounds unstable adversarial training with brittle and sample-inefficient reinforcement learning. We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation. Specifically, our discriminator is explicitly conditioned on two policies: the one from the previous generator's iteration and a learnable policy. When optimized, this discriminator directly learns the optimal generator's policy. Consequently, our discriminator's update solves the generator's optimization problem for free: learning a policy that imitates the expert does not require an additional optimization loop. This formulation effectively cuts by half the implementation and computational burden of Adversarial Imitation Learning algorithms by removing the Reinforcement Learning phase altogether. We show on a variety of tasks that our simpler approach is competitive to prevalent Imitation Learning methods.



Review for NeurIPS paper: Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Neural Information Processing Systems

Correctness: The claims and experiments seem mostly correct. While the analysis shows that the solution to the min-max problem (Eq. I would increase my review if the paper were updated to include a proof that the proposed algorithm converges. One comment about the experiments is that they don't actually show that the proposed method mimics the expert, only that running the proposed algorithm with data generated from an expert results in high reward. I would increase my review if an experiment were added to show that the learned policy actually mimics the demonstrator.


Review for NeurIPS paper: Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Neural Information Processing Systems

Even before the author response, the reviewers agreed that the results and approach were interesting. The response addressed the reviewers remaining concerns about novelty, baseline strength, and positioning with respect to prior work. This led the reviewers to a consensus that the paper should be accepted.


Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Neural Information Processing Systems

Adversarial Imitation Learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator. This alternated optimization is known to be delicate in practice since it compounds unstable adversarial training with brittle and sample-inefficient reinforcement learning. We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation. Specifically, our discriminator is explicitly conditioned on two policies: the one from the previous generator's iteration and a learnable policy. When optimized, this discriminator directly learns the optimal generator's policy. Consequently, our discriminator's update solves the generator's optimization problem for free: learning a policy that imitates the expert does not require an additional optimization loop.


Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

arXiv.org Artificial Intelligence

Adversarial imitation learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator. This alternated optimization is known to be delicate in practice since it compounds unstable adversarial training with brittle and sample-inefficient reinforcement learning. We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation. Specifically, our discriminator is explicitly conditioned on two policies: the one from the previous generator's iteration and a learnable policy. When optimized, this discriminator directly learns the optimal generator's policy. Consequently, our discriminator's update solves the generator's optimization problem for free: learning a policy that imitates the expert does not require an additional optimization loop. This formulation effectively cuts by half the implementation and computational burden of adversarial imitation learning algorithms by removing the reinforcement learning phase altogether. We show on a variety of tasks that our simpler approach is competitive to prevalent imitation learning methods.