Goto

Collaborating Authors

 msbe



On Double Descent in Reinforcement Learning with LSTD and Random Features

Brellmann, David, Berthier, Eloïse, Filliat, David, Frehse, Goran

arXiv.org Machine Learning

Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Square Bellman Error (MSBE) that feature correction terms responsible for the double-descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.


Toward Efficient Gradient-Based Value Estimation

Sharifnassab, Arsalan, Sutton, Richard

arXiv.org Artificial Intelligence

Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.


Revisiting Bellman Errors for Offline Model Selection

Zitovsky, Joshua P., de Marchi, Daniel, Agarwal, Rishabh, Kosorok, Michael R.

arXiv.org Machine Learning

Unfortunately, the best policy from a set of many policies such estimates are often inaccurate (Fu et al., 2021). As given only logged data, is crucial for applying an alternative, many works have explored using empirical offline RL in real-world settings. One idea that Bellman errors to perform OMS, but have found them to has been extensively explored is to select policies be poor predictors of value model accuracy (Irpan et al., based on the mean squared Bellman error 2019; Paine et al., 2020). This has led to a belief among (MSBE) of the associated Q-functions. However, many researchers that Bellman errors are not useful for previous work has struggled to obtain adequate OMS (Géron, 2019; Fujimoto et al., 2022). OMS performance with Bellman errors, leading many researchers to abandon the idea. To this end, To this end, we propose a new algorithm, Supervised Bellman we elucidate why previous work has seen pessimistic Validation (SBV), that provides a better proxy for the results with Bellman errors and identify true Bellman errors than empirical Bellman errors. SBV conditions under which OMS algorithms based achieves strong performance on diverse tasks ranging from on Bellman errors will perform well. Moreover, healthcare problems (Klasnja et al., 2015) to Atari games we develop a new estimator of the MSBE that is (Bellemare et al., 2013).


Robust Losses for Learning Value Functions

Patterson, Andrew, Liao, Victor, White, Martha

arXiv.org Artificial Intelligence

Most value function learning algorithms in reinforcement learning are based on the mean squared (projected) Bellman error. However, squared errors are known to be sensitive to outliers, both skewing the solution of the objective and resulting in high-magnitude and high-variance gradients. To control these high-magnitude updates, typical strategies in RL involve clipping gradients, clipping rewards, rescaling rewards, or clipping errors. While these strategies appear to be related to robust losses -- like the Huber loss -- they are built on semi-gradient update rules which do not minimize a known loss. In this work, we build on recent insights reformulating squared Bellman errors as a saddlepoint optimization problem and propose a saddlepoint reformulation for a Huber Bellman error and Absolute Bellman error. We start from a formalization of robust losses, then derive sound gradient-based approaches to minimize these losses in both the online off-policy prediction and control settings. We characterize the solutions of the robust losses, providing insight into the problem settings where the robust losses define notably better solutions than the mean squared Bellman error. Finally, we show that the resulting gradient-based algorithms are more stable, for both prediction and control, with less sensitivity to meta-parameters.


Control of Continuous Quantum Systems with Many Degrees of Freedom based on Convergent Reinforcement Learning

Wang, Zhikang

arXiv.org Artificial Intelligence

With the development of experimental quantum technology, quantum control has attracted increasing attention due to the realization of controllable artificial quantum systems. However, because quantum-mechanical systems are often too difficult to analytically deal with, heuristic strategies and numerical algorithms which search for proper control protocols are adopted, and, deep learning, especially deep reinforcement learning (RL), is a promising generic candidate solution for the control problems. Although there have been a few successful applications of deep RL to quantum control problems, most of the existing RL algorithms suffer from instabilities and unsatisfactory reproducibility, and require a large amount of fine-tuning and a large computational budget, both of which limit their applicability. To resolve the issue of instabilities, in this dissertation, we investigate the non-convergence issue of Q-learning. Then, we investigate the weakness of existing convergent approaches that have been proposed, and we develop a new convergent Q-learning algorithm, which we call the convergent deep Q network (C-DQN) algorithm, as an alternative to the conventional deep Q network (DQN) algorithm. We prove the convergence of C-DQN and apply it to the Atari 2600 benchmark. We show that when DQN fail, C-DQN still learns successfully. Then, we apply the algorithm to the measurement-feedback cooling problems of a quantum quartic oscillator and a trapped quantum rigid body. We establish the physical models and analyse their properties, and we show that although both C-DQN and DQN can learn to cool the systems, C-DQN tends to behave more stably, and when DQN suffers from instabilities, C-DQN can achieve a better performance. As the performance of DQN can have a large variance and lack consistency, C-DQN can be a better choice for researches on complicated control problems.


A Convergent and Efficient Deep Q Network Algorithm

Wang, Zhikang T., Ueda, Masahito

arXiv.org Artificial Intelligence

Despite the empirical success of the deep Q network (DQN) reinforcement learning algorithm and its variants, DQN is still not well understood and it does not guarantee convergence. In this work, we show that DQN can diverge and cease to operate in realistic settings. Although there exist gradient-based convergent methods, we show that they actually have inherent problems in learning behaviour and elucidate why they often fail in practice. To overcome these problems, we propose a convergent DQN algorithm (C-DQN) by carefully modifying DQN, and we show that the algorithm is convergent and can work with large discount factors ( 0.9998). It learns robustly in difficult settings and can learn several difficult games in the Atari 2600 benchmark where DQN fail, within a moderate computational budget. Our codes have been publicly released and can be used to reproduce our results.