Identifiability in Inverse Reinforcement Learning: Supplementary Material A Appendix: Proofs of Results
–Neural Information Processing Systems
Applying Jensen's inequality, we can see that, for s arg min Combining these inequalities, along with the fact γ < 1, we conclude that g(s) 0 for all s S. Again applying Jensen's inequality to (9), for s arg max Hence, as γ < 1, we conclude that g(s) 0 for all s S. Combining these results, we conclude that g 0, that is, V Proof of Theorem 2. From Theorem 1, if we can determine the value function for one of our agents, then the reward is uniquely identified. Given we know both agents' policies (π, π) and our agents are optimizing their respective MDPs, for every a A, s S, we know the value of λ log π(a|s) π(a|s) = γ T (s Therefore, the space of solutions to (10) is either empty (in which case no consistent reward exists), or determines v up to the addition of a constant. Given v is determined up to a constant we can use Theorem 1 to determine f, again up to the addition of a constant. Let R N be a set of natural numbers, with the property that R is closed under addition (if a, b R then a + b R). Suppose R has greatest common divisor 1 (i.e.
Neural Information Processing Systems
Jan-26-2025, 21:17:23 GMT
- Technology: