Regret of exploratory policy improvement and $q$-learning