AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Offline Reinforcement Learning in Large State Spaces: Algorithms and Guarantees

Jiang, Nan, Xie, Tengyang

arXiv.org Machine LearningOct-7-2025

This article introduces the theory of offline reinforcement learning in large state spaces, where good policies are learned from historical data without online interactions with the environment. Key concepts introduced include expressivity assumptions on function approximation (e.g., Bellman completeness vs. realizability) and data coverage (e.g., all-policy vs. single-policy coverage). A rich landscape of algorithms and results is described, depending on the assumptions one is willing to make and the sample and computational complexity guarantees one wishes to achieve. We also discuss open questions and connections to adjacent areas.

algorithm, assumption, international conference, (12 more...)

arXiv.org Machine Learning

2510.04088

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (0.50)
Workflow (0.46)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Balancing Interpretability and Performance in Reinforcement Learning: An Adaptive Spectral Based Linear Approach

Yi, Qianxin, Lin, Shao-Bo, Fan, Jun, Wang, Yao

arXiv.org Machine LearningOct-7-2025

Reinforcement learning (RL) has been widely applied to sequential decision making, where interpretability and performance are both critical for practical adoption. Current approaches typically focus on performance and rely on post hoc explanations to account for interpretability. Different from these approaches, we focus on designing an interpretability-oriented yet performance-enhanced RL approach. Specifically, we propose a spectral based linear RL method that extends the ridge regression-based approach through a spectral filter function. The proposed method clarifies the role of regularization in controlling estimation error and further enables the design of an adaptive regularization parameter selection strategy guided by the bias-variance trade-off principle. Theoretical analysis establishes near-optimal bounds for both parameter estimation and generalization error. Extensive experiments on simulated environments and real-world datasets from Kuaishou and Taobao demonstrate that our method either outperforms or matches existing baselines in decision quality. We also conduct interpretability analyses to illustrate how the learned policies make decisions, thereby enhancing user trust. These results highlight the potential of our approach to bridge the gap between RL theory and practical decision making, providing interpretability, accuracy, and adaptability in management contexts.

adaptive spectral, balancing interpretability and performance, reinforcement learning, (9 more...)

arXiv.org Machine Learning

2510.03722

Country:

North America > United States (0.27)
Asia > China > Shaanxi Province > Xi'an (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Hong Kong > Kowloon (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Finite-Horizon Offline RL with Linear $q^π$-Realizability and Concentrability

Tkachuk, Volodymyr, Szepesvári, Csaba, Tan, Xiaoqi

arXiv.org Machine LearningOct-7-2025

We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^π$-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.

assumption, hnull, probability, (14 more...)

arXiv.org Machine Learning

2510.03494

Country:

North America > Canada > Alberta (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)

Add feedback

Post-training Large Language Models for Diverse High-Quality Responses

Chen, Yilei, Chakraborty, Souradip, Wolf, Lorenz, Paschalidis, Yannis, Pacchiano, Aldo

arXiv.org Artificial IntelligenceOct-7-2025

Reinforcement learning (RL) has emerged as a popular method for post-training large language models (LLMs). While improving the model's performance on downstream tasks, it often reduces the model's output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on surface-level differences. We propose a novel training method named DQO (Diversity Quality Optimization) based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. DQO is flexible and can be applied on top of existing RL algorithms. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.

arXiv.org Artificial Intelligence

2509.04784

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.53)

Add feedback

Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets

Zhang, Jian'an

arXiv.org Artificial IntelligenceOct-7-2025

We introduce Tail-Safe, a deployability-oriented framework for derivatives hedging that unifies distributional, risk-sensitive reinforcement learning with a white-box control-barrier-function (CBF) quadratic-program (QP) safety layer tailored to financial constraints. The learning component combines an IQN-based distributional critic with a CVaR objective (IQN--CVaR--PPO) and a Tail-Coverage Controller that regulates quantile sampling through temperature tilting and tail boosting to stabilize small-$α$ estimation. The safety component enforces discrete-time CBF inequalities together with domain-specific constraints -- ellipsoidal no-trade bands, box and rate limits, and a sign-consistency gate -- solved as a convex QP whose telemetry (active sets, tightness, rate utilization, gate scores, slack, and solver status) forms an auditable trail for governance. We provide guarantees of robust forward invariance of the safe set under bounded model mismatch, a minimal-deviation projection interpretation of the QP, a KL-to-DRO upper bound linking per-state KL regularization to worst-case CVaR, concentration and sample-complexity results for the temperature-tilted CVaR estimator, and a CVaR trust-region improvement inequality under KL limits, together with feasibility persistence under expiry-aware tightening. Empirically, in arbitrage-free, microstructure-aware synthetic markets (SSVI $\to$ Dupire $\to$ VIX with ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading central performance and yields zero hard-constraint violations whenever the QP is feasible with zero slack. Telemetry is mapped to governance dashboards and incident workflows to support explainability and auditability. Limitations include reliance on synthetic data and simplified execution to isolate methodological contributions.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2510.04555

Country: Asia (0.46)

Genre:

Workflow (0.66)
Research Report (0.64)

Industry: Banking & Finance > Trading (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)

Add feedback

Video Game Level Design as a Multi-Agent Reinforcement Learning Problem

Earle, Sam, Jiang, Zehua, Vinitsky, Eugene, Togelius, Julian

arXiv.org Artificial IntelligenceOct-7-2025

Procedural Content Generation via Reinforcement Learning (PCGRL) offers a method for training controllable level designer agents without the need for human datasets, using metrics that serve as proxies for level quality as rewards. Existing PCGRL research focuses on single generator agents, but are bottlenecked by the need to frequently recalculate heuristics of level quality and the agent's need to navigate around potentially large maps. By framing level generation as a multi-agent problem, we mitigate the efficiency bottleneck of single-agent PCGRL by reducing the number of reward calculations relative to the number of agent actions. We also find that multi-agent level generators are better able to generalize to out-of-distribution map shapes, which we argue is due to the generators' learning more local, modular design policies. We conclude that treating content generation as a distributed, multi-agent task is beneficial for generating functional artifacts at scale.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2510.04862

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.89)

Add feedback

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Liu, Mickel, Jiang, Liwei, Liang, Yancheng, Du, Simon Shaolei, Choi, Yejin, Althoff, Tim, Jaques, Natasha

arXiv.org Artificial IntelligenceOct-7-2025

Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).

machine learning, natural language, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2506.07468

Country: North America > Mexico (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Games (1.00)
Information Technology (0.93)
Government > Military (0.93)
Media (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation

Zhi, Peiyuan, Li, Peiyang, Yin, Jianqin, Jia, Baoxiong, Huang, Siyuan

arXiv.org Artificial IntelligenceOct-7-2025

Robotic loco-manipulation tasks often involve contact-rich interactions with the environment, requiring the joint modeling of contact force and robot position. However, recent visuomotor policies often focus solely on learning position or force control, overlooking their co-learning. In this work, we propose the first unified policy for legged robots that jointly models force and position control learned without reliance on force sensors. By simulating diverse combinations of position and force commands alongside external disturbance forces, we use reinforcement learning to learn a policy that estimates forces from historical robot states and compensates for them through position and velocity adjustments. This policy enables a wide range of manipulation behaviors under varying force and position inputs, including position tracking, force application, force tracking, and compliant interactions. Furthermore, we demonstrate that the learned policy enhances trajectory-based imitation learning pipelines by incorporating essential contact information through its force estimation module, achieving approximately 39.5% higher success rates across four challenging contact-rich manipulation tasks compared to position-control policies. Extensive experiments on both a quadrupedal manipulator and a humanoid robot validate the versatility and robustness of the proposed policy across diverse scenarios.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2505.20829

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.49)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.46)

Add feedback

A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

Hu, Yuzheng, Wu, Fan, Ye, Haotian, Forsyth, David, Zou, James, Jiang, Nan, Ma, Jiaqi W., Zhao, Han

arXiv.org Artificial IntelligenceOct-7-2025

Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability. Data attribution provides a principled way to trace model behavior back to training samples, yet existing methods assume fixed datasets, which is violated in online RL where each experience both updates the policy and shapes future data collection. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a \emph{local} attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record's contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Together, these results open a new direction for making online RL more interpretable, efficient, and effective.

machine learning, natural language, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2505.19281

Country: North America > United States > Illinois (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Instructional Material > Online (0.80)

Industry:

Information Technology (0.68)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Behavior Injection: Preparing Language Models for Reinforcement Learning

Cen, Zhepeng, Yao, Yihang, Han, William, Liu, Zuxin, Zhao, Ding

arXiv.org Artificial IntelligenceOct-7-2025

Reinforcement learning (RL) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RL finetuning: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increase the performance gain from RL over the pre-RL model.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2505.18917

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback