Batch Policy Learning in Average Reward Markov Decision Processes