Regret of exploratory policy improvement and $q$-learning

Open in new window