AITopics

2507.21385

Genre: Research Report > New Finding (0.34)

Industry:

Telecommunications (0.94)
Energy (0.68)
Consumer Products & Services > Travel (0.54)
Information Technology > Security & Privacy (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningJul-29-2025

PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data

Mandyam, Aishwarya, Meng, Jason, Gao, Ge, Sun, Jiankai, Schwager, Mac, Engelhardt, Barbara E., Brunskill, Emma

Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^π(s_0)$-- such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2507.20068

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.05)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.83)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)

Javed, Khurram, Sutton, Richard S.

Swift-Sarsa: Fast and Robust Linear Control

arXiv.org Machine LearningJul-29-2025

Javed, Sharifnassab, and Sutton (2024) introduced a new algorithm for TD learning -- SwiftTD -- that augments True Online TD($λ$) with step-size optimization, a bound on the effective learning rate, and step-size decay. In their experiments SwiftTD outperformed True Online TD($λ$) and TD($λ$) on a variety of prediction tasks derived from Atari games, and its performance was robust to the choice of hyper-parameters. In this extended abstract we extend SwiftTD to work for control problems. We combine the key ideas behind SwiftTD with True Online Sarsa($λ$) to develop an on-policy reinforcement learning algorithm called $\textit{Swift-Sarsa}$. We propose a simple benchmark for linear on-policy control called the $\textit{operant conditioning benchmark}$. The key challenge in the operant conditioning benchmark is that a very small subset of input signals are relevant for decision making. The majority of the signals are noise sampled from a non-stationary distribution. To learn effectively, the agent must learn to differentiate between the relevant signals and the noisy signals, and minimize prediction errors by assigning credit to the weight parameters associated with the relevant signals. Swift-Sarsa, when applied to the operant conditioning benchmark, learned to assign credit to the relevant signals without any prior knowledge of the structure of the problem. It opens the door for solution methods that learn representations by searching over hundreds of millions of features in parallel without performance degradation due to noisy or bad features.

machine learning, reinforcement learning, swift-sarsa, (17 more...)

arXiv.org Machine Learning

2507.19539

Country: North America > Canada > Alberta (0.15)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games > Computer Games (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Tiwari, Saket, Gottesman, Omer, Konidaris, George

Geometry of Neural Reinforcement Learning in Continuous State and Action Spaces

Advances in reinforcement learning (RL) have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens to understand the locally attained set of states. The set of all parametrised policies learnt through a semi-gradient based approach induces a set of attainable states in RL. We show that the training dynamics of a two-layer neural policy induce a low dimensional manifold of attainable states embedded in the high-dimensional nominal state space trained using an actor-critic algorithm. We prove that, under certain conditions, the dimensionality of this manifold is of the order of the dimensionality of the action space. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments and also demonstrate the results in a toy environment with varying dimensionality. We also show the applicability of this theoretical result by introducing a local manifold learning layer to the policy and value function networks to improve the performance in control environments with very high degrees of freedom by changing one layer of the neural network to learn sparse representations.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2507.20853

Country:

Europe (0.46)
North America > United States (0.28)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Group Sequence Policy Optimization

Zheng, Chujie, Liu, Shixuan, Li, Mingze, Chen, Xiong-Hui, Yu, Bowen, Gao, Chang, Dang, Kai, Liu, Yuqiong, Men, Rui, Yang, An, Zhou, Jingren, Lin, Junyang

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

large language model, machine learning, reinforcement learning, (14 more...)

2507.18071

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Tang, Fei, Gu, Zhangxuan, Lu, Zhengxi, Liu, Xuyang, Shen, Shuheng, Meng, Changhua, Wang, Wen, Zhang, Wenqi, Shen, Yongliang, Lu, Weiming, Xiao, Jun, Zhuang, Yueting

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

large language model, machine learning, reinforcement learning, (20 more...)

2507.15846

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Viquerat, Jonathan, Garnier, Paul, Bateni, Amirhossein, Hachem, Elie

Dragonfly: a modular deep reinforcement learning library

Dragonfly is a deep reinforcement learning library focused on modularity, in order to ease experimentation and developments. It relies on a json serialization that allows to swap building blocks and perform parameter sweep, while minimizing code maintenance. Some of its features are specifically designed for CPU-intensive environments, such as numerical simulations. Its performance on standard agents using common benchmarks compares favorably with the literature.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

2505.03778

Country: Europe (0.15)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Zhang, Yuyou, Corcodel, Radu, Zhao, Ding

Bipedalism for Quadrupedal Robots: Versatile Loco-Manipulation through Risk-Adaptive Reinforcement Learning

Loco-manipulation of quadrupedal robots has broadened robotic applications, but using legs as manipulators often compromises locomotion, while mounting arms complicates the system. To mitigate this issue, we introduce bipedalism for quadrupedal robots, thus freeing the front legs for versatile interactions with the environment. We propose a risk-adaptive distributional Reinforcement Learning (RL) framework designed for quadrupedal robots walking on their hind legs, balancing worst-case conservativeness with optimal performance in this inherently unstable task. During training, the adaptive risk preference is dynamically adjusted based on the uncertainty of the return, measured by the coefficient of variation of the estimated return distribution. Extensive experiments in simulation show our method's superior performance over baselines. Real-world deployment on a Unitree Go2 robot further demonstrates the versatility of our policy, enabling tasks like cart pushing, obstacle probing, and payload transport, while showcasing robustness against challenging dynamics and external disturbances.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2507.20382

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Industry: Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots > Locomotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Learning from Expert Factors: Trajectory-level Reward Shaping for Formulaic Alpha Mining

Zhao, Junjie, Zhang, Chengxi, Wang, Chenkai, Yang, Peng

Reinforcement learning (RL) has successfully automated the complex process of mining formulaic alpha factors, for creating interpretable and profitable investment strategies. However, existing methods are hampered by the sparse rewards given the underlying Markov Decision Process. This inefficiency limits the exploration of the vast symbolic search space and destabilizes the training process. To address this, Trajectory-level Reward Shaping (TLRS), a novel reward shaping method, is proposed. TLRS provides dense, intermediate rewards by measuring the subsequence-level similarity between partially generated expressions and a set of expert-designed formulas. Furthermore, a reward centering mechanism is introduced to reduce training variance. Extensive experiments on six major Chinese and U.S. stock indices show that TLRS significantly improves the predictive power of mined factors, boosting the Rank Information Coefficient by 9.29% over existing potential-based shaping algorithms. Notably, TLRS achieves a major leap in computational efficiency by reducing its time complexity with respect to the feature dimension from linear to constant, which is a significant improvement over distance-based baselines.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2507.20263

Country:

North America > United States (0.46)
Asia > China (0.28)

Genre: Research Report (1.00)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

Xu, Xingcheng

Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an "effective reward" aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems.

large language model, machine learning, reinforcement learning, (19 more...)

2507.2015

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)