A maximum-entropy approach to off-policy evaluation in average-reward MDPs