Goto

Collaborating Authors

 realizability


Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

arXiv.org Machine Learning

We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^ฯ€$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.


Adaptivity Under Realizability Constraints: Comparing In-Context and Agentic Learning

arXiv.org Machine Learning

We compare in-context learning with fixed queries and agentic learning with adaptive queries for uniform approximation of task families. We consider two settings: an unrestricted regime, where querying and approximation are arbitrary functions, and a realizable regime, where we require these operations to be implemented by ReLU neural networks. In both settings, adaptivity never hinders approximation performance. However, this advantage can change when one passes from the unrestricted regime to the realizable regime. We identify four distinct approximation scenarios, each witnessed by an explicit task family: (a) no advantage of adaptivity; (b) an advantage in the unrestricted regime that persists under ReLU realizability; (c) an advantage that arises only under realizability; and (d) an advantage that disappears under realizability. This demonstrates that representational constraints interact profoundly with the effect of adaptivity.


Bellman Residual Orthogonalization for Offline Reinforcement Learning Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

We propose and analyze a reinforcement learning principle that approximates the1 Bellman equations by enforcing their validity only along an user-defined space of2 test functions. Focusing on applications to model-free offline RL with function3 approximation, we exploit this principle to derive confidence intervals for off-policy4 evaluation, as well as to optimize over policies within a prescribed policy class.5 We prove an oracle inequality on our policy optimization procedure in terms of6 a trade-off between the value and uncertainty of an arbitrary comparator policy.7 Different choices of test function spaces allow us to tackle different problems8 within a common framework. We characterize the loss of efficiency in moving9 from on-policy to off-policy data using our procedures, and establish connections10 to concentrability coefficients studied in past work. We examine in depth the11 implementation of our methods with linear function approximation, and provide12 theoretical guarantees with polynomial-time implementations even when Bellman13 closure does not hold.14



AsymptoticallyExactErrorCharacterizationof OfflinePolicyEvaluationwithMisspecifiedLinear Models

Neural Information Processing Systems

Recently, theoretical understanding of OPE has been rapidly advanced under (approximate) realizability assumptions, i.e., where the environments of interest are well approximated with the given hypothetical models.




AnExponentialLowerBoundforLinearly-Realizable MDPswithConstantSuboptimalityGap

Neural Information Processing Systems

A fundamental question in the theory of reinforcement learning is: suppose the optimalQ-function lies inthe linear span ofagivenddimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolves this question in the negative, providinganexponential(ind)samplesizelowerbound,whichholdsevenifthe agent has access to a generative model of the environment. One may hope that such a lower can be circumvented with an even stronger assumption that there isaconstant gapbetween the optimalQ-value ofthe best action and that ofthe second-best action (for allstates); indeed, the construction inWeisz etal.