Goto

Collaborating Authors

 Reinforcement Learning


ARelatedWork

Neural Information Processing Systems

Incontrast,our work is concerned with an overall limit on the total amount of information an agent may acquire fromtheenvironment and,inturn,howthattranslates intoitsselection ofafeasible learning target.


Deciding WhattoModel: Value-EquivalentSampling forReinforcementLearning

Neural Information Processing Systems

Inthiswork,weconsider thescenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality.


Variance-Aware Off-Policy Evaluation with Linear Function Approximation

Neural Information Processing Systems

We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.



Supplementary Material for Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble AEnsemble gradient diversification

Neural Information Processing Systems

Proposition 1. Suppose Qฯ†j(s,a) = Q(s,a) and Qฯ†j(s,) is locally linear in the neighborhood of a for all j [N]. Let ฮปmin and wmin be the smallest eigenvalue and the corresponding normalized eigenvector of the matrix Var aQฯ†j(s,a) and > 0 be the value such that mini6=j aQฯ†i(s,a), aQฯ†j(s,a) = 1 . We first prove that the smallest eigenvalue ฮปmin of Var aQฯ†j(s,a) is upper-bounded by some constant multiple of . By Lemma 1, the total variance of the matrix is less or equal to N 1N. Note that, using the fact that the Q-values coincide at the action a and the local linearity of the Q-functions, we have derived Var(Qฯ†j(s,a+ kw)) = k2w|Var aQฯ†j(s,a) w. (2) Plugging w = wmin in Equation (2) and using Equation (1), we have Var(Qฯ†j(s,a+ kwmin)) = k2w|minVar aQฯ†j(s,a) wmin = k2ฮปmin A.2 Relationship between maximizing the total variance and maximizing the smallest eigenvalue As we have shown in Section 4, maximizing the total variance of the matrix Var ( aQฯ†i(s,a)) is equivalent to minimizing the cosine similarity of all distinct pairs of the gradients aQฯ†i(s,a), 2 which makes the gradients uniformly distributed on the unit sphere S|A| 1.


Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble

Neural Information Processing Systems

Offline reinforcement learning (offline RL), which aims to find an optimal policy from a previously collected static dataset, bears algorithmic difficulties due to function approximation errors from out-of-distribution (OOD) data points. To this end, offline RL algorithms adopt either a constraint or a penalty term that explicitly guides the policy to stay close to the given dataset. However, prior methods typically require accurate estimation of the behavior policy or sampling from OOD data points, which themselves can be a non-trivial problem. Moreover, these methods under-utilize the generalization ability of deep neural networks and often fall into suboptimal solutions too close to the given dataset. In this work, we propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution. We show that the clipped Q-learning, a technique widely used in online RL, can be leveraged to successfully penalize OOD data points with high prediction uncertainties. Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning. Based on this observation, we propose an ensemble-diversified actor-critic algorithm that reduces the number of required ensemble networks down to a tenth compared to the naive ensemble while achieving state-of-the-art performance on most of the D4RL benchmarks considered.



Faster Non-asymptotic Convergence for Double Q-learning

Neural Information Processing Systems

Double Q-learning (Hasselt, 2010) has gained significant success in practice due to its effectiveness in overcoming the overestimation issue of Q-learning. However, the theoretical understanding of double Q-learning is rather limited. The only existing finite-time analysis was recently established in (Xiong et al., 2020), where the polynomial learning rate adopted in the analysis typically yields a slower convergence rate. This paper tackles the more challenging case of a constant learning rate, and develops new analytical tools that improve the existing convergence rate by orders of magnitude.



boovi_camera

Neural Information Processing Systems

Despite the tremendous success of reinforcement learning (RL) with function approximation, efficient exploration remains a significant challenge, both practically and theoretically. In particular, existing theoretically grounded RL algorithms based on upper confidence bounds (UCBs), such as optimistic least-squares value iteration (LSVI), are often incompatible with practically powerful function approximators, such as neural networks. In this paper, we develop a variant of bootstrapped LSVI, namely BooVI, which bridges such a gap between practice and theory.