Quantum policy gradient algorithms
Jerbi, Sofiene, Cornelissen, Arjan, Ozols, Māris, Dunjko, Vedran
–arXiv.org Artificial Intelligence
When studying the potential advantages of quantum computing in machine learning, a natural question that arises is whether quantum algorithms that exploit quantum access to data can speed up learning. In the context of supervised learning, this led to the development of algorithms based on quantum RAMs, which can achieve high-degree polynomial improvements over their classical analogs [1]. In reinforcement learning, where we consider learning agents interacting with task environments, the question becomes: can quantum interactions with an environment, and in particular the ability to explore several trajectories in superposition, be beneficial for a learning agent. In recent years, several works have approached this question from a variety of angles [2]: based on Grover's algorithm [3], some works have for instance shown that searching for an optimal sequence of actions in an environment can be done using quadratically fewer interactions given the appropriate oracular access to the environment [4, 5, 6]. Other works have considered the more general problem of finding the optimal policy in a Markov Decision Process (MDP), and have found that up to quadratic speed-ups in the number of interactions are also possible, again given the proper oracular access [7, 8, 9, 10]. Finally, tailored MDP environments (based, e.g., on Simon's problem) have also been introduced, which allow for exponential quantum speed-ups in learning times compared to the best classical agents [11]. Yet, all the quantum algorithms that have been proposed in this quantum-accessible setting remain inefficient in the most well-publicised use cases of reinforcement learning, such as Go [12], city navigation [13], and computer games [14]: environments with large state-action spaces. Indeed, aside from the task-specific algorithms of Ref. [11], the proposed algorithms scale at best as the square root of the size of the state-action space, which is intractable in most modern-day applications that deal for instance with image-based inputs. In the classical literature, modern approaches to reinforcement learning in large spaces commonly replace the explicit storage of a policy (and/or a value function) in a table of values by a parametrized model (e.g., a
arXiv.org Artificial Intelligence
Dec-19-2022
- Country:
- Europe
- Netherlands
- South Holland > Leiden (0.04)
- North Holland > Amsterdam (0.04)
- Austria > Tyrol
- Innsbruck (0.04)
- Netherlands
- Asia > Middle East
- Jordan (0.04)
- Europe
- Genre:
- Research Report (0.64)
- Industry:
- Leisure & Entertainment > Games (0.34)