Exploration Policies for On-the-Fly Controller Synthesis: A Reinforcement Learning Approach