AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

VFOG: Variance-Reduced Fast Optimistic Gradient Methods for a Class of Nonmonotone Generalized Equations

Tran-Dinh, Quoc, Nguyen-Trung, Nghia

arXiv.org Machine LearningAug-26-2025

We develop a novel optimistic gradient-type algorithmic framework, combining both Nesterov's acceleration and variance-reduction techniques, to solve a class of generalized equations involving possibly nonmonotone operators in data-driven applications. Our framework covers a wide class of stochastic variance-reduced schemes, including mini-batching, and control variate unbiased and biased estimators. We establish that our method achieves $\mathcal{O}(1/k^2)$ convergence rates in expectation on the squared norm of residual under the Lipschitz continuity and a ``co-hypomonotonicity-type'' assumptions, improving upon non-accelerated counterparts by a factor of $1/k$. We also prove faster $o(1/k^2)$ convergence rates, both in expectation and almost surely. In addition, we show that the sequence of iterates of our method almost surely converges to a solution of the underlying problem. We demonstrate the applicability of our method using general error bound criteria, covering mini-batch stochastic estimators as well as three well-known control variate estimators: loopless SVRG, SAGA, and loopless SARAH, for which the last three variants attain significantly better oracle complexity compared to existing methods. We validate our framework and theoretical results through two numerical examples. The preliminary results illustrate promising performance of our accelerated method over its non-accelerated counterparts.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2508.16791

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Education (0.67)
Health & Medicine (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Mathematics of Computing (0.67)
Information Technology > Game Theory (0.67)
(2 more...)

Add feedback

Efficient Computation of Blackwell Optimal Policies using Rational Functions

Mukherjee, Dibyangshu, Kalyanakrishnan, Shivaram

arXiv.org Artificial IntelligenceAug-26-2025

Markov Decision Problems (MDPs) provide a founda-tional framework for modelling sequential decision-making across diverse domains, guided by optimality criteria such as discounted and average rewards. However, these criteria have inherent limitations: discounted optimality may overly prioritise short-term rewards, while average optimality relies on strong structural assumptions. Blackwell optimality addresses these challenges, offering a robust and comprehensive criterion that ensures optimality under both discounted and average reward frameworks. Despite its theoretical appeal, existing algorithms for computing Blackwell Optimal (BO) policies are computationally expensive or hard to implement. In this paper we describe procedures for computing BO policies using an ordering of rational functions in the vicinity of 1 . We adapt state-of-the-art algorithms for deterministic and general MDPs, replacing numerical evaluations with symbolic operations on rational functions to derive bounds independent of bit complexity. For deterministic MDPs, we give the first strongly polynomial-time algorithms for computing BO policies, and for general MDPs we obtain the first subexponential-time algorithm. We further generalise several policy iteration algorithms, extending the best known upper bounds from the discounted to the Blackwell criterion.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2508.18252

Genre: Research Report (0.40)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Mathematics of Computing (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

Add feedback

Proximal Supervised Fine-Tuning

Zhu, Wenhong, Xie, Ruobing, Wang, Rui, Sun, Xingwu, Wang, Di, Liu, Pengfei

arXiv.org Artificial IntelligenceAug-26-2025

Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT), a fine-tuning objective that incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages . Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization. Recently, post-training has become a crucial part of the overall training process. In particular, reinforcement learning (RL) algorithms, such as PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), have demonstrated significant effectiveness when applied to language models (LMs) focused on reasoning tasks. As RL is scaled over time, foundation models gain the capacity to address complex problems through more profound and extended reasoning (OpenAI, 2024; Guo et al., 2025). These reasoning models offer an abundant and valuable latent thoughts (Ruan et al., 2025) across the internet.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.17784

Country: Asia (0.46)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Multi-layer Abstraction for Nested Generation of Options (MANGO) in Hierarchical Reinforcement Learning

Arcudi, Alessio, Sartor, Davide, Sinigaglia, Alberto, François-Lavet, Vincent, Susto, Gian Antonio

arXiv.org Artificial IntelligenceAug-26-2025

This paper introduces MANGO (Multilayer Abstraction for Nested Generation of Options), a novel hierarchical reinforcement learning framework designed to address the challenges of long-term sparse reward environments. MANGO decomposes complex tasks into multiple layers of abstraction, where each layer defines an abstract state space and employs options to modularize trajectories into macro-actions. These options are nested across layers, allowing for efficient reuse of learned movements and improved sample efficiency. The framework introduces intra-layer policies that guide the agent's transitions within the abstract state space, and task actions that integrate task-specific components such as reward functions. Experiments conducted in procedurally-generated grid environments demonstrate substantial improvements in both sample efficiency and generalization capabilities compared to standard RL methods. MANGO also enhances interpretability by making the agent's decision-making process transparent across layers, which is particularly valuable in safety-critical and industrial applications. Future work will explore automated discovery of abstractions and abstract actions, adaptation to continuous or fuzzy environments, and more robust multi-layer training strategies.

abstraction, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2508.17751

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Fair Cooperation in Mixed-Motive Games via Conflict-Aware Gradient Adjustment

Kim, Woojun, Sycara, Katia

arXiv.org Artificial IntelligenceAug-26-2025

Multi-agent reinforcement learning in mixed-motive settings presents a fundamental challenge: agents must balance individual interests with collective goals, which are neither fully aligned nor strictly opposed. To address this, reward restructuring methods such as gifting and intrinsic motivation have been proposed. However, these approaches primarily focus on promoting cooperation by managing the trade-off between individual and collective returns, without explicitly addressing fairness with respect to the agents' task-specific rewards. In this paper, we propose an adaptive conflict-aware gradient adjustment method that promotes cooperation while ensuring fairness in individual rewards. The proposed method dynamically balances policy gradients derived from individual and collective objectives in situations where the two objectives are in conflict. By explicitly resolving such conflicts, our method improves collective performance while preserving fairness across agents. We provide theoretical results that guarantee monotonic non-decreasing improvement in both the collective and individual objectives and ensure fairness. Empirical results in sequential social dilemma environments demonstrate that our approach outperforms baselines in terms of social welfare while ensuring fairness among agents.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2508.17696

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.46)
Social Sector (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)

Add feedback

DSADF: Thinking Fast and Slow for Decision Making

Dou, Zhihao, Cui, Dongfei, Yan, Jun, Wang, Weida, Chen, Benteng, Wang, Haoming, Xie, Zeke, Zhang, Shufei

arXiv.org Artificial IntelligenceAug-26-2025

Although Reinforcement Learning (RL) agents are effective in well-defined environments, they often struggle to generalize their learned policies to dynamic settings due to their reliance on trial-and-error interactions. Recent work has explored applying Large Language Models (LLMs) or Vision Language Models (VLMs) to boost the generalization of RL agents through policy optimization guidance or prior knowledge. However, these approaches often lack seamless coordination between the RL agent and the foundation model, leading to unreasonable decision-making in unfamiliar environments and efficiency bottlenecks. Making full use of the inferential capabilities of foundation models and the rapid response capabilities of RL agents and enhancing the interaction between the two to form a dual system is still a lingering scientific question. To address this problem, we draw inspiration from Kahneman's theory of fast thinking (System 1) and slow thinking (System 2), demonstrating that balancing intuition and deep reasoning can achieve nimble decision-making in a complex world. In this study, we propose a Dual-System Adaptive Decision Framework (DSADF), integrating two complementary modules: System 1, comprising an RL agent and a memory space for fast and intuitive decision making, and System 2, driven by a VLM for deep and analytical reasoning. DSADF facilitates efficient and adaptive decision-making by combining the strengths of both systems. The empirical study in the video game environment: Crafter and Housekeep demonstrates the effectiveness of our proposed method, showing significant improvements in decision abilities for both unseen and known tasks.

large language model, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2505.08189

Country: Asia > China (1.00)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)

Industry:

Materials > Metals & Mining (1.00)
Leisure & Entertainment > Games > Computer Games (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

VIN-NBV: A View Introspection Network for Next-Best-View Selection

Frahm, Noah, Zhao, Dongxu, Beltran, Andrea Dunn, Alterovitz, Ron, Frahm, Jan-Michael, Oliva, Junier, Sengupta, Roni

arXiv.org Artificial IntelligenceAug-26-2025

Next Best View (NBV) algorithms aim to maximize 3D scene acquisition quality using minimal resources, e.g. number of acquisitions, time taken, or distance traversed. Prior methods often rely on coverage maximization as a proxy for reconstruction quality, but for complex scenes with occlusions and finer details, this is not always sufficient and leads to poor reconstructions. Our key insight is to train an acquisition policy that directly optimizes for reconstruction quality rather than just coverage. To achieve this, we introduce the View Introspection Network (VIN): a lightweight neural network that predicts the Relative Reconstruction Improvement (RRI) of a potential next viewpoint without making any new acquisitions. We use this network to power a simple, yet effective, sequential samplingbased greedy NBV policy. Our approach, VIN-NBV, generalizes to unseen object categories, operates without prior scene knowledge, is adaptable to resource constraints, and can handle occlusions. We show that our RRI fitness criterion leads to a ~30% gain in reconstruction quality over a coverage-based criterion using the same greedy strategy. Furthermore, VIN-NBV also outperforms deep reinforcement learning methods, Scan-RL and GenNBV, by ~40%.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2505.06219

Country: Europe (0.28)

Genre:

Research Report (0.64)
Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

USPR: Learning a Unified Solver for Profiled Routing

Hua, Chuanbo, Berto, Federico, Zhao, Zhikai, Son, Jiwoo, Kwon, Changhyun, Park, Jinkyoo

arXiv.org Artificial IntelligenceAug-26-2025

The Profiled V ehicle Routing Problem (PVRP) extends the classical VRP by incorporating vehicle-client-specific preferences and constraints, reflecting real-world requirements such as zone restrictions and service-level preferences. While recent reinforcement-learning solvers have shown promising performance, they require retraining for each new profile distribution, suffer from poor representation ability, and struggle to generalize to out-of-distribution instances. In this paper, we address these limitations by introducing U nified Solver for Profiled R outing (USPR), a novel framework that natively handles arbitrary profile types. USPR introduces on three key innovations: (i) Profile Embeddings (PE) to encode any combination of profile types; (ii) Multi-Head Profiled Attention (MHP A), an attention mechanism that models rich interactions between vehicles and clients; (iii) Profile-aware Score Reshaping (PSR), which dynamically adjusts decoder logits using profile scores to improve generalization. Empirical results on diverse PVRP benchmarks demonstrate that USPR achieves state-of-the-art results among learning-based methods while offering significant gains in flexibility and computational efficiency. We make our source code publicly available to foster future research.

machine learning, natural language, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2505.05119

Genre: Research Report (1.00)

Industry: Transportation (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
(4 more...)

Add feedback

Consciousness as a Functor

Mahadevan, Sridhar

arXiv.org Artificial IntelligenceAug-26-2025

We propose a novel theory of consciousness as a functor (CF) that receives and transmits contents from unconscious memory into conscious memory. Our CF framework can be seen as a categorial formulation of the Global Workspace Theory proposed by Baars. CF models the ensemble of unconscious processes as a topos category of coalgebras. The internal language of thought in CF is defined as a Multi-modal Universal Mitchell-Benabou Language Embedding (MUMBLE). We model the transmission of information from conscious short-term working memory to long-term unconscious memory using our recently proposed Universal Reinforcement Learning (URL) framework. To model the transmission of information from unconscious long-term memory into resource-constrained short-term memory, we propose a network economic model.

category, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2508.17561

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

ReviBranch: Deep Reinforcement Learning for Branch-and-Bound with Revived Trajectories

Jiabao, Dou, Jiayi, Nie, Cheng, Yihang, Liu, Jinwei, Ji, Yingrui, Xiao, Canran, Du, Feixiang, Xiao, Jiaping

arXiv.org Artificial IntelligenceAug-26-2025

The Branch-and-bound (B&B) algorithm is the main solver for Mixed Integer Linear Programs (MILPs), where the selection of branching variable is essential to computational efficiency. However, traditional heuristics for branching often fail to generalize across heterogeneous problem instances, while existing learning-based methods such as imitation learning (IL) suffers from dependence on expert demonstration quality, and reinforcement learning (RL) struggles with limitations in sparse rewards and dynamic state representation challenges. To address these issues, we propose ReviBranch, a novel deep RL framework that constructs revived trajectories by reviving explicit historical correspondences between branching decisions and their corresponding graph states along search-tree paths. During training, ReviBranch enables agents to learn from complete structural evolution and temporal dependencies within the branching process. Additionally, we introduce an importance-weighted reward redistribution mechanism that transforms sparse terminal rewards into dense stepwise feedback, addressing the sparse reward challenge. Extensive experiments on different MILP benchmarks demonstrate that ReviBranch outperforms state-of-the-art RL methods, reducing B&B nodes by 4.0% and LP iterations by 2.2% on large-scale instances. The results highlight the robustness and generalizability of ReviBranch across heterogeneous MILP problem classes.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2508.17452

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback