Reinforcement Learning
Variance-Aware Off-Policy Evaluation with Linear Function Approximation
We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
Supplementary Material for Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble AEnsemble gradient diversification
Proposition 1. Suppose Qฯj(s,a) = Q(s,a) and Qฯj(s,) is locally linear in the neighborhood of a for all j [N]. Let ฮปmin and wmin be the smallest eigenvalue and the corresponding normalized eigenvector of the matrix Var aQฯj(s,a) and > 0 be the value such that mini6=j aQฯi(s,a), aQฯj(s,a) = 1 . We first prove that the smallest eigenvalue ฮปmin of Var aQฯj(s,a) is upper-bounded by some constant multiple of . By Lemma 1, the total variance of the matrix is less or equal to N 1N. Note that, using the fact that the Q-values coincide at the action a and the local linearity of the Q-functions, we have derived Var(Qฯj(s,a+ kw)) = k2w|Var aQฯj(s,a) w. (2) Plugging w = wmin in Equation (2) and using Equation (1), we have Var(Qฯj(s,a+ kwmin)) = k2w|minVar aQฯj(s,a) wmin = k2ฮปmin A.2 Relationship between maximizing the total variance and maximizing the smallest eigenvalue As we have shown in Section 4, maximizing the total variance of the matrix Var ( aQฯi(s,a)) is equivalent to minimizing the cosine similarity of all distinct pairs of the gradients aQฯi(s,a), 2 which makes the gradients uniformly distributed on the unit sphere S|A| 1.
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble
Offline reinforcement learning (offline RL), which aims to find an optimal policy from a previously collected static dataset, bears algorithmic difficulties due to function approximation errors from out-of-distribution (OOD) data points. To this end, offline RL algorithms adopt either a constraint or a penalty term that explicitly guides the policy to stay close to the given dataset. However, prior methods typically require accurate estimation of the behavior policy or sampling from OOD data points, which themselves can be a non-trivial problem. Moreover, these methods under-utilize the generalization ability of deep neural networks and often fall into suboptimal solutions too close to the given dataset. In this work, we propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution. We show that the clipped Q-learning, a technique widely used in online RL, can be leveraged to successfully penalize OOD data points with high prediction uncertainties. Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning. Based on this observation, we propose an ensemble-diversified actor-critic algorithm that reduces the number of required ensemble networks down to a tenth compared to the naive ensemble while achieving state-of-the-art performance on most of the D4RL benchmarks considered.
Faster Non-asymptotic Convergence for Double Q-learning
Double Q-learning (Hasselt, 2010) has gained significant success in practice due to its effectiveness in overcoming the overestimation issue of Q-learning. However, the theoretical understanding of double Q-learning is rather limited. The only existing finite-time analysis was recently established in (Xiong et al., 2020), where the polynomial learning rate adopted in the analysis typically yields a slower convergence rate. This paper tackles the more challenging case of a constant learning rate, and develops new analytical tools that improve the existing convergence rate by orders of magnitude.
boovi_camera
Despite the tremendous success of reinforcement learning (RL) with function approximation, efficient exploration remains a significant challenge, both practically and theoretically. In particular, existing theoretically grounded RL algorithms based on upper confidence bounds (UCBs), such as optimistic least-squares value iteration (LSVI), are often incompatible with practically powerful function approximators, such as neural networks. In this paper, we develop a variant of bootstrapped LSVI, namely BooVI, which bridges such a gap between practice and theory.