Goto

Collaborating Authors

 realizability


Computational Hardness of Reinforcement Learning with Partial qπ-Realizability

Neural Information Processing Systems

This paper investigates the computational complexity of reinforcement learning within a novel linear function approximation regime, termed partial qπ-realizability. In this framework, the objective is to learn an ϵ-optimal policy with respect to a predefined policy set Π, under the assumption that all value functions corresponding to policies in Π are linearly realizable. This framework adopts assumptions that are weaker than those in the qπ-realizability setting yet stronger than those in the q -realizability setup. As a result, it provides a more practical model for reinforcement learning scenarios where function approximation naturally arise. We prove that learning an ϵ-optimal policy in this newly defined setting is computationally hard. More specifically, we establish NP-hardness under a parameterized greedy policy set (i.e., argmax) and, further, show that--unless NP = RP--an exponential lower bound (exponential in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those obtained in the q -realizability settings, and suggest that computational difficulty persists even when the policy class Πis expanded beyond the optimal policy, reinforcing the unbreakable nature of the computational hardness result regarding partial qπ-realizability under two important policy sets. To establish our negative result, our primary technical contribution is a reduction from two complexity problems, δ-MAX-3SAT and δ-MAX-3SAT(b), to instances of our problem settings: GLINEAR-κ-RL (under the greedy policy set) and SLINEAR-κ-RL (under the softmax policy set), respectively. Our findings indicate that positive computational results are generally unattainable in the context of partial qπ-realizability, in sharp contrast to the qπ-realizability setting under a generative access model.


Optimal Rates in Continual Linear Regression via Increasing Regularization

Neural Information Processing Systems

We study realizable continual linear regression under random task orderings, a common setting for developing continual learning theory. In this setup, the worstcase expected loss after k learning iterations admits a lower bound of Ω(1/k). However, prior work using an unregularized scheme has only established an upper bound of O(1/k1/4), leaving a significant gap. Our paper proves that this gap can be narrowed, or even closed, using two frequently used regularization schemes: (1) explicit isotropic ℓ2 regularization, and (2) implicit regularization via finite step budgets. We show that these approaches, which are used in practice to mitigate forgetting, reduce to stochastic gradient descent (SGD) on carefully defined surrogate losses. Through this lens, we identify a fixed regularization strength that yields a near-optimal rate of O(logk/k). Moreover, formalizing and analyzing a generalized variant of SGD for time-varying functions, we derive an increasing regularization strength schedule that provably achieves an optimal rate of O(1/k). This suggests that schedules that increase the regularization coefficient or decrease the number of steps per task are beneficial, at least in the worst case.


Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

arXiv.org Machine Learning

We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^π$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.


Adaptivity Under Realizability Constraints: Comparing In-Context and Agentic Learning

arXiv.org Machine Learning

We compare in-context learning with fixed queries and agentic learning with adaptive queries for uniform approximation of task families. We consider two settings: an unrestricted regime, where querying and approximation are arbitrary functions, and a realizable regime, where we require these operations to be implemented by ReLU neural networks. In both settings, adaptivity never hinders approximation performance. However, this advantage can change when one passes from the unrestricted regime to the realizable regime. We identify four distinct approximation scenarios, each witnessed by an explicit task family: (a) no advantage of adaptivity; (b) an advantage in the unrestricted regime that persists under ReLU realizability; (c) an advantage that arises only under realizability; and (d) an advantage that disappears under realizability. This demonstrates that representational constraints interact profoundly with the effect of adaptivity.


Bellman Residual Orthogonalization for Offline Reinforcement Learning Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

We propose and analyze a reinforcement learning principle that approximates the1 Bellman equations by enforcing their validity only along an user-defined space of2 test functions. Focusing on applications to model-free offline RL with function3 approximation, we exploit this principle to derive confidence intervals for off-policy4 evaluation, as well as to optimize over policies within a prescribed policy class.5 We prove an oracle inequality on our policy optimization procedure in terms of6 a trade-off between the value and uncertainty of an arbitrary comparator policy.7 Different choices of test function spaces allow us to tackle different problems8 within a common framework. We characterize the loss of efficiency in moving9 from on-policy to off-policy data using our procedures, and establish connections10 to concentrability coefficients studied in past work. We examine in depth the11 implementation of our methods with linear function approximation, and provide12 theoretical guarantees with polynomial-time implementations even when Bellman13 closure does not hold.14



AsymptoticallyExactErrorCharacterizationof OfflinePolicyEvaluationwithMisspecifiedLinear Models

Neural Information Processing Systems

Recently, theoretical understanding of OPE has been rapidly advanced under (approximate) realizability assumptions, i.e., where the environments of interest are well approximated with the given hypothetical models.




AnExponentialLowerBoundforLinearly-Realizable MDPswithConstantSuboptimalityGap

Neural Information Processing Systems

A fundamental question in the theory of reinforcement learning is: suppose the optimalQ-function lies inthe linear span ofagivenddimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolves this question in the negative, providinganexponential(ind)samplesizelowerbound,whichholdsevenifthe agent has access to a generative model of the environment. One may hope that such a lower can be circumvented with an even stronger assumption that there isaconstant gapbetween the optimalQ-value ofthe best action and that ofthe second-best action (for allstates); indeed, the construction inWeisz etal.