Review for NeurIPS paper: Error Bounds of Imitating Policies and Environments

Neural Information Processing Systems 

Weaknesses: The approach used to compare BC seems out of date as there are many more recent approaches that address the compounding error problem and have been shown to achieve better results in imitation learning, such as maxent IRL and other more recent methods in IRL. Reactive policy matching can lead to bad states which is why value iteration and Q learning are used to estimate state-action pairs in terms of future cumulative rewards. For example, Max Ent IRL aims uses value iteration to capture feature preferences of the experts to model their behavior in unseen settings with the explicit goal of maximizing the distribution of the policy through near optimal state-action pairs rather than matching the expert policy directly. The learned policies may differ from expert demonstrations as long as the state-action features being optimized by the learner are similar to those of the demonstrator. In other words, there are multiple near-optimal policies that would be acceptable and are still indicative of the demonstrated behavior which are learned by this method reducing the effect of compounding errors and allowing for deviations from the expert state distribution.