Goto

Collaborating Authors

 Mhammedi, Zakaria


Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

arXiv.org Artificial Intelligence

Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.


Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

arXiv.org Artificial Intelligence

We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either full-information loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transition (Foster et al., 2022). First, we improve the $poly(d, A, H)T^{5/6}$ regret bound of Zhao et al. (2024) to $poly(d, A, H)T^{2/3}$ for the full-information unknown transition setting, where d is the rank of the transitions, A is the number of actions, H is the horizon length, and T is the number of episodes. Next, we initiate the study on the setting with bandit loss feedback and unknown transitions. Assuming that the loss has a linear structure, we propose both model based and model free algorithms achieving $poly(d, A, H)T^{2/3}$ regret, though they are computationally inefficient. We also propose oracle-efficient model-free algorithms with $poly(d, A, H)T^{4/5}$ regret. We show that the linear structure is necessary for the bandit case without structure on the reward function, the regret has to scale polynomially with the number of states. This is contrary to the full-information case (Zhao et al., 2024), where the regret can be independent of the number of states even for unstructured reward function.


Reinforcement Learning under Latent Dynamics: Toward Statistical and Algorithmic Modularity

arXiv.org Machine Learning

Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations, but the underlying (''latent'') dynamics are comparatively simple. However, outside of restrictive settings such as small latent spaces, the fundamental statistical requirements and algorithmic principles for reinforcement learning under latent dynamics are poorly understood. This paper addresses the question of reinforcement learning under $\textit{general}$ latent dynamics from a statistical and algorithmic perspective. On the statistical side, our main negative result shows that most well-studied settings for reinforcement learning with function approximation become intractable when composed with rich observations; we complement this with a positive result, identifying latent pushforward coverability as a general condition that enables statistical tractability. Algorithmically, we develop provably efficient observable-to-latent reductions -- that is, reductions that transform an arbitrary algorithm for the latent MDP into an algorithm that can operate on rich observations -- in two settings: one where the agent has access to hindsight observations of the latent dynamics [LADZ23], and one where the agent can estimate self-predictive latent models [SAGHCB20]. Together, our results serve as a first step toward a unified statistical and algorithmic theory for reinforcement learning under latent dynamics.


Online Convex Optimization with a Separation Oracle

arXiv.org Artificial Intelligence

In this paper, we introduce a new projection-free algorithm for Online Convex Optimization (OCO) with a state-of-the-art regret guarantee among separation-based algorithms. Existing projection-free methods based on the classical Frank-Wolfe algorithm achieve a suboptimal regret bound of $O(T^{3/4})$, while more recent separation-based approaches guarantee a regret bound of $O(\kappa \sqrt{T})$, where $\kappa$ denotes the asphericity of the feasible set, defined as the ratio of the radii of the containing and contained balls. However, for ill-conditioned sets, $\kappa$ can be arbitrarily large, potentially leading to poor performance. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{dT} + \kappa d)$, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per round. Crucially, the main term in the bound, $\widetilde{O}(\sqrt{d T})$, is independent of $\kappa$, addressing the limitations of previous methods. Additionally, as a by-product of our analysis, we recover the $O(\kappa \sqrt{T})$ regret bound of existing OCO algorithms with a more straightforward analysis and improve the regret bound for projection-free online exp-concave optimization. Finally, for constrained stochastic convex optimization, we achieve a state-of-the-art convergence rate of $\widetilde{O}(\sigma/\sqrt{T} + \kappa d/T)$, where $\sigma$ represents the noise in the stochastic gradients, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per iteration.


Improved Sample Complexity of Imitation Learning for Barrier Model Predictive Control

arXiv.org Artificial Intelligence

Imitation learning has emerged as a powerful tool in machine learning, enabling agents to learn complex behaviors by imitating expert demonstrations acquired either from a human demonstrator or a policy computed offline [3, 11, 12, 13]. Despite its significant success, imitation learning often suffers from a compounding error problem: Successive evaluations of the approximate policy could accumulate error, resulting in out-of-distribution failures [3]. Recent results in imitation learning [31, 32, 34] have identified smoothness (i.e., Lipschitzness of the derivative of the optimal controller with respect to the initial state) and stability of the expert as two key properties that circumvent this issue, thereby allowing for end-to-end performance guarantees for the final learned controller. In this paper, our focus is on enabling such guarantees when the expert being imitated is a Model Predictive Controller (MPC), a powerful class of control algorithms based on solving an optimization problem over a receding prediction horizon [23]. In some cases, the solution to this multiparametric optimization problem, known as the explicit MPC representation [6], can be pre-computed. For instance, in our setup -- linear systems with polytopic constraints -- the optimal control input is a piecewise affine (and, hence, highly non-smooth) function of the state [6].


Fully Unconstrained Online Learning

arXiv.org Machine Learning

We provide an online learning algorithm that obtains regret $G\|w_\star\|\sqrt{T\log(\|w_\star\|G\sqrt{T})} + \|w_\star\|^2 + G^2$ on $G$-Lipschitz convex losses for any comparison point $w_\star$ without knowing either $G$ or $\|w_\star\|$. Importantly, this matches the optimal bound $G\|w_\star\|\sqrt{T}$ available with such knowledge (up to logarithmic factors), unless either $\|w_\star\|$ or $G$ is so large that even $G\|w_\star\|\sqrt{T}$ is roughly linear in $T$. Thus, it matches the optimal bound in all cases in which one can achieve sublinear regret, which arguably most "interesting" scenarios.


The Power of Resets in Online Reinforcement Learning

arXiv.org Machine Learning

Simulators are a pervasive tool in reinforcement learning, but most existing algorithms cannot efficiently exploit simulator access -- particularly in high-dimensional domains that require general function approximation. We explore the power of simulators through online reinforcement learning with {local simulator access} (or, local planning), an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training. We use local simulator access to unlock new statistical guarantees that were previously out of reach: - We show that MDPs with low coverability (Xie et al. 2023) -- a general structural condition that subsumes Block MDPs and Low-Rank MDPs -- can be learned in a sample-efficient fashion with only $Q^{\star}$-realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions. - As a consequence, we show that the notorious Exogenous Block MDP problem (Efroni et al. 2022) is tractable under local simulator access. The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation.


Efficient Model-Free Exploration in Low-Rank MDPs

arXiv.org Artificial Intelligence

A major challenge in reinforcement learning is to develop practical, sample-efficient algorithms for exploration in high-dimensional domains where generalization and function approximation is required. Low-Rank Markov Decision Processes -- where transition probabilities admit a low-rank factorization based on an unknown feature embedding -- offer a simple, yet expressive framework for RL with function approximation, but existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions such as latent variable structure, access to model-based function approximation, or reachability. In this work, we propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs that is both computationally efficient and model-free, allowing for general function approximation and requiring no additional structural assumptions. Our algorithm, VoX, uses the notion of a generalized optimal design for the feature embedding as an efficiently computable basis for exploration, performing efficient optimal design computation by interleaving representation learning and policy optimization. Our analysis -- which is appealingly simple and modular -- carefully combines several techniques, including a new reduction from optimal design computation to policy optimization based on the Frank-Wolfe method, and an improved analysis of a certain minimax representation learning objective found in prior work.


Projection-Free Online Convex Optimization via Efficient Newton Iterations

arXiv.org Artificial Intelligence

This paper presents new projection-free algorithms for Online Convex Optimization (OCO) over a convex domain $\mathcal{K} \subset \mathbb{R}^d$. Classical OCO algorithms (such as Online Gradient Descent) typically need to perform Euclidean projections onto the convex set $\cK$ to ensure feasibility of their iterates. Alternative algorithms, such as those based on the Frank-Wolfe method, swap potentially-expensive Euclidean projections onto $\mathcal{K}$ for linear optimization over $\mathcal{K}$. However, such algorithms have a sub-optimal regret in OCO compared to projection-based algorithms. In this paper, we look at a third type of algorithms that output approximate Newton iterates using a self-concordant barrier for the set of interest. The use of a self-concordant barrier automatically ensures feasibility without the need for projections. However, the computation of the Newton iterates requires a matrix inverse, which can still be expensive. As our main contribution, we show how the stability of the Newton iterates can be leveraged to compute the inverse Hessian only a vanishing fraction of the rounds, leading to a new efficient projection-free OCO algorithm with a state-of-the-art regret bound.


Representation Learning with Multi-Step Inverse Kinematics: An Efficient and Optimal Approach to Rich-Observation RL

arXiv.org Artificial Intelligence

We study the design of sample-efficient algorithms for reinforcement learning in the presence of rich, high-dimensional observations, formalized via the Block MDP problem. Existing algorithms suffer from either 1) computational intractability, 2) strong statistical assumptions that are not necessarily satisfied in practice, or 3) suboptimal sample complexity. We address these issues by providing the first computationally efficient algorithm that attains rate-optimal sample complexity with respect to the desired accuracy level, with minimal statistical assumptions. Our algorithm, MusIK, combines systematic exploration with representation learning based on multi-step inverse kinematics, a learning objective in which the aim is to predict the learner's own action from the current observation and observations in the (potentially distant) future. MusIK is simple and flexible, and can efficiently take advantage of general-purpose function approximation. Our analysis leverages several new techniques tailored to non-optimistic exploration algorithms, which we anticipate will find broader use.