AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Lee, Kyungbok, Sarteau, Angelica Cristello, Kosorok, Michael R.

arXiv.org Machine LearningFeb-13-2026

We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow predefined policies. We establish finite-sample suboptimality error bounds and derive global convergence rates under Besov regularity, demonstrating that CycleFQI mitigates the curse of dimensionality compared to monolithic baselines. Additionally, we propose a sieve-based method for asymptotic inference of optimal policy values under a margin condition. Experiments on simulated and real-world Type 1 Diabetes data sets demonstrate CycleFQI's effectiveness.

artificial intelligence, machine learning, provable offline reinforcement learning, (13 more...)

arXiv.org Machine Learning

2602.11679

Country:

North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
North America > United States > Tennessee > Davidson County > Nashville (0.04)
Europe > Portugal > Porto > Porto (0.04)

Genre: Research Report > Experimental Study (0.45)

Industry:

Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
Education > Health & Safety > School Nutrition (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

Liu, Haolin, Snyder, Braham, Wei, Chen-Yu

arXiv.org Machine LearningFeb-13-2026

We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are $Q^\star$-realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative by establishing an information-theoretic lower bound. Going substantially beyond this, we introduce a general framework that characterizes the intrinsic complexity of a given $Q^\star$ function class, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). This complexity recovers and improves the quantities underlying the guarantees of Chen and Jiang (2022) and Uehara et al. (2023), and extends to broader settings. Our decision-estimation decomposition can be combined with a wide range of $Q^\star$ estimation procedures, modularizing and generalizing existing approaches. Beyond the general framework, we make further contributions: By developing a novel second-order performance difference lemma, we obtain the first $ε^{-2}$ sample complexity under partial coverage for soft $Q$-learning, improving the $ε^{-4}$ bound of Uehara et al. (2023). We remove Chen and Jiang's (2022) need for additional online interaction when the value gap of $Q^\star$ is unknown. We also give the first characterization of offline learnability for general low-Bellman-rank MDPs without Bellman completeness (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021), a canonical setting in online RL that remains unexplored in offline RL except for special cases. Finally, we provide the first analysis for CQL under $Q^\star$-realizability and Bellman completeness beyond the tabular case.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2602.12107

Country:

North America > United States > Virginia (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

fa809df3ec53cc5781e5078b7d500a5d-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 23:54:25 GMT

algorithm, iteration, value function, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

MCP: LearningComposableHierarchicalControl withMultiplicativeCompositionalPolicies

Neural Information Processing SystemsFeb-12-2026, 23:53:51 GMT

Furthermore, whencontrollingcomplex high-dimensional morphologies, such as humanoid bodies, tasks often require coordination of multiple skills simultaneously.

machine learning, reinforcement learning, urlhttp, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

DoubleCheckYourStateBeforeTrustingIt: Confidence-AwareBidirectionalOfflineModel-Based Imagination

Neural Information Processing SystemsFeb-12-2026, 23:03:25 GMT

OfflineRLisdeemed to be promising [16, 14] as online learning requires the agent to continuously interact with the environment, which howevermaybecostly,time-consuming, orevendangerous.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Industry:

Health & Medicine (0.46)
Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.97)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning

Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J. Mankowitz, Shie Mannor

Neural Information Processing SystemsFeb-12-2026, 23:02:52 GMT

Deep RL(DRL) tackles the curse of dimensionality due to large state spaces by utilizing a Deep Neural Network (DNN) to approximate the value function and/or the policy.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Limiting Extrapolation in Linear Approximate Value Iteration

Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill

Neural Information Processing SystemsFeb-12-2026, 22:38:54 GMT

Neural Information Processing Systems http://nips.cc/

anchor point, iteration, sample complexity, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

908c9a564a86426585b29f5335b619bc-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-12-2026, 22:38:38 GMT

This approach is often preferable when using "standard" parametric5 regression algorithms, as the weighted p-norm can be directly minimized at learning time (e.g., in least-squares6 regression,the`2-normisminimized). This is not surprising as in the worst case the inherent Bellman error may12 be unbounded and standard AVI tends to diverge. Recent work (Jinglin Chen, Nan Jiang,Information-Theoretic13 Considerations in Batch Reinforcement Learning, ICML 2019, Conjecture 8) has even conjectured an exponential14 lower bound in case of unbounded Bellman error. In the paper we propose afirst heuristic algorithm to automatically construct aset22 ofanchor points (see beginning ofpage 6). Akeybenefit ofourapproach isthat itdoes notmodify theunderlying (linear) feature51 representation, allowing theuser tousethelinear representation with, forexample, approximate value iteration, and52 should this fail, the user can switch toour algorithm and progressively increase the number of support points while53 keeping the same feature representation.

artificial intelligence, machine learning, reinforcement learning, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)

Add feedback