Goto

Collaborating Authors

 Agarwal, Alekh


VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

arXiv.org Artificial Intelligence

We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming completeness and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\tilde{O}(d\sqrt{HT}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.


Minimax Regret Optimization for Robust Machine Learning under Distribution Shift

arXiv.org Machine Learning

In this paper, we consider learning scenarios where the learned model is evaluated under an unknown test distribution which potentially differs from the training distribution (i.e. distribution shift). The learner has access to a family of weight functions such that the test distribution is a reweighting of the training distribution under one of these functions, a setting typically studied under the name of Distributionally Robust Optimization (DRO). We consider the problem of deriving regret bounds in the classical learning theory setting, and require that the resulting regret bounds hold uniformly for all potential test distributions. We show that the DRO formulation does not guarantee uniformly small regret under distribution shift. We instead propose an alternative method called Minimax Regret Optimization (MRO), and show that under suitable conditions this method achieves uniformly low regret across all test distributions. We also adapt our technique to have stronger guarantees when the test distributions are heterogeneous in their similarity to the training data. Given the widespead optimization of worst case risks in current approaches to robust machine learning, we believe that MRO can be a strong alternative to address distribution shift scenarios.


Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach

arXiv.org Artificial Intelligence

Representation learning in Reinforcement Learning (RL) has gained increasing attention in recent years from both theoretical and empirical research communities (Schwarzer et al., 2020; Laskin et al., 2020) due to its potential in enabling sample-efficient non-linear function approximation, the benefits in multitask settings (Zhang et al., 2020; Yang et al., 2022; Sodhani et al., 2021), and the potential to leverage advances on representation learning in related areas such as computer vision and natural language processing. Despite this interest, there remains a gap between the theoretical and empirical literature, where the theoretically sound methods are seldom evaluated or even implemented and often rely on strong assumptions, while the empirical techniques are not backed with any theoretical guarantees even under stylistic assumptions. This leaves open the key challenge of designing representation learning methods that are both theoretically sound and empirically effective. In this work, we tackle this challenge for a special class of problems called Block MDPs, where the high dimensional and rich observations of the agent are generated from certain latent states and there exists some fixed, but unknown mapping from observations to the latent states (each observation is generated only by one latent state). Prior works (Dann et al., 2018; Du et al., 2019; Misra et al., 2020; Zhang et al., 2020; Sodhani et al., 2021) have motivated the Block MDP model through scenarios such as navigation tasks and image based robotics tasks where the observations can often be reasonably mapped to the latent physical location and states.


Bellman-consistent Pessimism for Offline Reinforcement Learning

arXiv.org Machine Learning

Using past experiences to learn improved behavior for future interactions is a critical capability for a Reinforcement Learning (RL) agent. However, robustly extrapolating knowledge from a historical dataset for sequential decision making is highly challenging, particularly in settings where function approximation is employed to generalize across related observations. In this paper, we provide a systematic treatment of such scenarios with general function approximation, and devise algorithms that can provably leverage an arbitrary historical dataset to discover the policy that obtains the largest guaranteed rewards, amongst all possible scenarios consistent with the dataset. The problem of learning a good policy from historical datasets, typically called batch or offline RL, has a long history [see e.g., Precup et al., 2000; Antos et al., 2008; Levine et al., 2020, and references therein]. Many prior works [e.g., Precup et al., 2000; Antos et al., 2008; Chen and Jiang, 2019] make the so-called coverage assumptions on the dataset, requiring the dataset to contain any possible state, action pair or trajectory with a lower bounded probability. These assumptions are evidently prohibitive in practice, particularly for problems with large state and/or action spaces. Furthermore, the methods developed under these assumptions routinely display unstable behaviors such as lack of convergence or error amplification, when coverage assumptions are violated [Wang et al., 2020, 2021]. Driven by these instabilities, a growing body of recent literature has pursued a so-called best effort style of guarantee instead.


Provably Correct Optimization and Exploration with Non-linear Policies

arXiv.org Machine Learning

Policy optimization methods remain a powerful workhorse in empirical Reinforcement Learning (RL), with a focus on neural policies that can easily reason over complex and continuous state and/or action spaces. Theoretical understanding of strategic exploration in policy-based methods with non-linear function approximation, however, is largely missing. In this paper, we address this question by designing ENIAC, an actor-critic method that allows non-linear function approximation in the critic. We show that under certain assumptions, e.g., a bounded eluder dimension $d$ for the critic class, the learner finds a near-optimal policy in $O(\poly(d))$ exploration rounds. The method is robust to model misspecification and strictly extends existing works on linear function approximation. We also develop some computational optimizations of our approach with slightly worse statistical guarantees and an empirical adaptation building on existing deep RL tools. We empirically evaluate this adaptation and show that it outperforms prior heuristics inspired by linear methods, establishing the value via correctly reasoning about the agent's uncertainty under non-linear function approximation.


Towards a Dimension-Free Understanding of Adaptive Linear Control

arXiv.org Machine Learning

We study the problem of adaptive control of the linear quadratic regulator for systems in very high, or even infinite dimension. We demonstrate that while sublinear regret requires finite dimensional inputs, the ambient state dimension of the system need not be bounded in order to perform online control. We provide the first regret bounds for LQR which hold for infinite dimensional systems, replacing dependence on ambient dimension with more natural notions of problem complexity. Our guarantees arise from a novel perturbation bound for certainty equivalence which scales with the prediction error in estimating the system parameters, without requiring consistent parameter recovery in more stringent measures like the operator norm. When specialized to finite dimensional settings, our bounds recover near optimal dimension and time horizon dependence.


Model-free Representation Learning and Exploration in Low-rank MDPs

arXiv.org Machine Learning

The low rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments.


PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning

arXiv.org Artificial Intelligence

Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case $\ell_{\infty}$ assumptions; this includes approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.


Provably Good Batch Reinforcement Learning Without Great Exploration

arXiv.org Artificial Intelligence

Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a new decision policy may visit states and actions outside the support of the batch data, and function approximation and optimization with limited samples can further increase the potential of learning policies with overly optimistic estimates of their future performance. Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes. Theoretical work that provides strong guarantees on the performance of the output policy relies on a strong concentrability assumption, that makes it unsuitable for cases where the ratio between state-action distributions of behavior policy and some candidate policies is large. This is because in the traditional analysis, the error bound scales up with this ratio. We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees. In certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. We highlight the necessity of our conservative update and the limitations of previous algorithms and analyses by illustrative MDP examples, and demonstrate an empirical comparison of our algorithm and other state-of-the-art batch RL baselines in standard benchmarks.


FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs

arXiv.org Machine Learning

The ability to learn effective transformations of complex data sources, sometimes called representation learning, is an essential primitive in modern machine learning, leading to remarkable achievements in language modeling, vision, and serving as a partial explanation for the success of deep learning more broadly (Bengio et al., 2013). In Reinforcement Learning (RL), several works have shown empirically that learning succinct representations of perceptual inputs can accelerate the search for decision-making policies (Pathak et al., 2017; Tang et al., 2017; Oord et al., 2018; Srinivas et al., 2020). However, representation learning for RL is far more subtle than it is for supervised learning (Du et al., 2019a; Van Roy and Dong, 2019; Lattimore and Szepesvari, 2019), and the theoretical foundations of representation learning for RL are nascent. The first question that arises in this context is: what is a good representation? Intuitively, a good representation should help us achieve greater sample efficiency on downstream tasks.