A Bandit Framework for Optimal Selection of Reinforcement Learning Agents