Single-partition adaptive Q-learning
Araújo, João Pedro, Figueiredo, Mário, Botto, Miguel Ayala
This paper introduces single-partition adaptive Q-learning (SPAQL), an algorithm for model-free episodic reinforcement learning (RL), which adaptively partitions the state-action space of a Markov decision process (MDP), while simultaneously learning a time-invariant policy (i. e., the mapping from states to actions does not depend explicitly on the episode time step) for maximizing the cumulative reward. The trade-off between exploration and exploitation is handled by using a mixture of upper confidence bounds (UCB) and Boltzmann exploration during training, with a temperature parameter that is automatically tuned as training progresses. The algorithm is an improvement over adaptive Q-learning (AQL). It converges faster to the optimal solution, while also using fewer arms. Tests on episodes with a large number of time steps show that SPAQL has no problems scaling, unlike AQL. Based on this empirical evidence, we claim that SPAQL may have a higher sample efficiency than AQL, thus being a relevant contribution to the field of efficient model-free RL methods.
Jul-13-2020
- Country:
- Asia (0.14)
- Europe
- North America > United States
- Massachusetts (0.14)
- New York > New York County
- New York City (0.14)
- Oceania > Australia (0.14)
- Genre:
- Research Report (0.83)
- Technology: