A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Neural Information Processing Systems 

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e.