A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

Open in new window