AITopics

2505.00787

Country:

Europe (1.00)
North America > United States > Massachusetts (0.45)
North America > United States > California (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Research Report > Promising Solution (0.86)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

arXiv.org Machine LearningNov-14-2025

Operator Models for Continuous-Time Offline Reinforcement Learning

Hoischen, Nicolas, Bevanda, Petar, Beier, Max, Sosnowski, Stefan, Houska, Boris, Hirche, Sandra

Continuous-time stochastic processes underlie many natural and engineered systems. In healthcare, autonomous driving, and industrial control, direct interaction with the environment is often unsafe or impractical, motivating offline reinforcement learning from historical data. However, there is limited statistical understanding of the approximation errors inherent in learning policies from offline datasets. We address this by linking reinforcement learning to the Hamilton-Jacobi-Bellman equation and proposing an operator-theoretic algorithm based on a simple dynamic programming recursion. Specifically, we represent our world model in terms of the infinitesimal generator of controlled diffusion processes learned in a reproducing kernel Hilbert space. By integrating statistical learning methods and operator theory, we establish global convergence of the value function and derive finite-sample guarantees with bounds tied to system properties such as smoothness and stability. Our theoretical and numerical results indicate that operator-based approaches may hold promise in solving offline reinforcement learning using continuous-time optimal control.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

2511.10383

Country:

Europe (0.67)
North America (0.46)

Genre: Research Report (0.81)

Industry:

Energy (0.46)
Transportation (0.34)
Information Technology > Robotics & Automation (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsNov-13-2025, 23:42:55 GMT

A Appendix

Confined trust regions are a stable way of making large updates and avoiding pessimistic coefficients.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Neural Information Processing SystemsNov-13-2025, 23:33:14 GMT

1737656c4dc65027939e47e4587ce95e-Paper-Conference.pdf

large language model, machine learning, reinforcement learning, (21 more...)

Neural Information Processing Systems

Country: Europe > Austria (0.04)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.71)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)

Varricchione, Giovanni, Klassen, Toryn Q., Alechina, Natasha, Dastani, Mehdi, Logan, Brian, McIlraith, Sheila A.

Pushdown Reward Machines for Reinforcement Learning

Reward machines (RMs) are automata structures that encode (non-Markovian) reward functions for reinforcement learning (RL). RMs can reward any behaviour representable in regular languages and, when paired with RL algorithms that exploit RM structure, have been shown to significantly improve sample efficiency in many domains. In this work, we present pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata. pdRMs can recognise and reward temporally extended behaviours representable in deterministic context-free languages, making them more expressive than reward machines. We introduce two variants of pdRM-based policies, one which has access to the entire stack of the pdRM, and one which can only access the top $k$ symbols (for a given constant $k$) of the stack. We propose a procedure to check when the two kinds of policies (for a given environment, pdRM, and constant $k$) achieve the same optimal state values. We then provide theoretical results establishing the expressive power of pdRMs, and space complexity results for the proposed learning problems. Lastly, we propose an approach for off-policy RL algorithms that exploits counterfactual experiences with pdRMs. We conclude by providing experimental results showing how agents can be trained to perform tasks representable in deterministic context-free languages using pdRMs.

artificial intelligence, machine learning, reinforcement learning, (21 more...)

2508.06894

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Netherlands (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Government (0.46)
Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)

Gonzalez, Rachel T, Abbott, Madeline R, Nallamothu, Brahmajee, Hummel, Scott, Dorsch, Michael, Dempsey, Walter

Practical considerations when designing an online learning algorithm for an app-based mHealth intervention

arXiv.org Machine LearningNov-13-2025

The ubiquitous nature of mobile health (mHealth) technology has expanded opportunities for the integration of reinforcement learning into traditional clinical trial designs, allowing researchers to learn individualized treatment policies during the study. LowSalt4Life 2 (LS4L2) is a recent trial aimed at reducing sodium intake among hypertensive individuals through an app-based intervention. A reinforcement learning algorithm, which was deployed in one of the trial arms, was designed to send reminder notifications to promote app engagement in contexts where the notification would be effective, i.e., when a participant is likely to open the app in the next 30-minute and not when prior data suggested reduced effectiveness. Such an algorithm can improve app-based mHealth interventions by reducing participant burden and more effectively promoting behavior change. We encountered various challenges during the implementation of the learning algorithm, which we present as a template to solving challenges in future trials that deploy reinforcement learning algorithms. We provide template solutions based on LS4L2 for solving the key challenges of (i) defining a relevant reward, (ii) determining a meaningful timescale for optimization, (iii) specifying a robust statistical model that allows for automation, (iv) balancing model flexibility with computational cost, and (v) addressing missing values in gradually collected data.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2511.08719

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.05)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.67)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Kleinman, Michael, Trager, Matthew, Achille, Alessandro, Xia, Wei, Soatto, Stefano

e1: Learning Adaptive Control of Reasoning Effort

Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables a 2-3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2510.27042

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Yuan, Yurun, Chen, Fan, Jia, Zeyu, Rakhlin, Alexander, Xie, Tengyang

Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as $Q$-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

large language model, machine learning, reinforcement learning, (15 more...)

2505.15311

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Rainbow Delay Compensation: A Multi-Agent Reinforcement Learning Framework for Mitigating Delayed Observation

Fu, Songchen, Chen, Siang, Zhao, Shaojing, Bai, Letian, Li, Ta, Yan, Yonghong

In real-world multi-agent systems (MASs), observation delays are ubiquitous, preventing agents from making decisions based on the environment's true state. An individual agent's local observation typically comprises multiple components from other agents or dynamic entities within the environment. These discrete observation components with varying delay characteristics pose significant challenges for multi-agent reinforcement learning (MARL). In this paper, we first formulate the decentralized stochastic individual delay partially observable Markov decision process (DSID-POMDP) by extending the standard Dec-POMDP. We then propose the Rainbow Delay Compensation (RDC), a MARL training framework for addressing stochastic individual delays, along with recommended implementations for its constituent modules. We implement the DSID-POMDP's observation generation pattern using standard MARL benchmarks, including MPE and SMAC. Experiments demonstrate that baseline MARL methods suffer severe performance degradation under fixed and unfixed delays. The RDC-enhanced approach mitigates this issue, remarkably achieving ideal delay-free performance in certain delay scenarios while maintaining generalizability. Our work provides a novel perspective on multi-agent delayed observation problems and offers an effective solution framework. The source code is available at https://github.com/linkjoker1006/RDC-pymarl.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2505.03586

Genre: Research Report > New Finding (0.46)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Kordabad, Arash Bahari, Brandner, Dean, Gros, Sebastien, Lucia, Sergio, Soudjani, Sadegh

Quasi-Newton Compatible Actor-Critic for Deterministic Policies

In this paper, we propose a second-order deterministic actor-critic framework in reinforcement learning that extends the classical deterministic policy gradient method to exploit curvature information of the performance function. Building on the concept of compatible function approximation for the critic, we introduce a quadratic critic that simultaneously preserves the true policy gradient and an approximation of the performance Hessian. A least-squares temporal difference learning scheme is then developed to estimate the quadratic critic parameters efficiently. This construction enables a quasi-Newton actor update using information learned by the critic, yielding faster convergence compared to first-order methods. The proposed approach is general and applicable to any differentiable policy class. Numerical examples demonstrate that the method achieves improved convergence and performance over standard deterministic actor-critic baselines.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2511.09509

Country:

Europe > Germany (0.28)
Europe > United Kingdom (0.28)

Genre: Research Report (0.50)

Industry: Energy (0.96)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)