Reviews: Compatible Reward Inverse Reinforcement Learning

Neural Information Processing Systems 

This paper proposes an approach for behavioral cloning that constructs a function space for a particular parametric policy model based on the null space of the policy gradient. I think a running example (e.g., for discrete MDP) would help explain the approach. I found myself flipping back and forth from the Algorithm (page 6) to the description of each step. I have some lingering confusion about using Eq. I assume a similar estimator is employed for d(s,a).