AITopics

2510.09222

Country: Asia > China (0.46)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceOct-14-2025

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Mao, Hanyi, Xiao, Quanjia, Pang, Lei, Liu, Haixiao

We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping on the importance-sampling (IS) weight. We study RL methods with sequence-level IS and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the optimization direction. FSPO introduces a simple remedy: we clip the sequence log-IS ratio with a band that scales as $\sqrt{L}$. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a cosine directional guarantee between the clipped and true updates. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms baselines across model sizes and evaluation datasets, with the largest gains on the Qwen3-8B-Base model.

large language model, machine learning, reinforcement learning, (16 more...)

2509.09177

Genre:

Research Report (0.66)
Instructional Material > Course Syllabus & Notes (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)

PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

Zitouni, Abdelkrim, Hennequin, Mehdi, Agoun, Juba, Horache, Ryan, Kabachi, Nadia, Rivasplata, Omar

We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound's practical utility through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2510.10544

Country: Europe > France (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Experience-Efficient Model-Free Deep Reinforcement Learning Using Pre-Training

Yang, Ruoxing

We introduce PPOPT - Proximal Policy Optimization using Pretraining, a novel, model-free deep-reinforcement-learning algorithm that leverages pretraining to achieve high training efficiency and stability on very small training samples in physics-based environments. Reinforcement learning agents typically rely on large samples of environment interactions to learn a policy. However, frequent interactions with a (computer-simulated) environment may incur high computational costs, especially when the environment is complex. Our main innovation is a new policy neural network architecture that consists of a pretrained neural network middle section sandwiched between two fully-connected networks. Pretraining part of the network on a different environment with similar physics will help the agent learn the target environment with high efficiency because it will leverage a general understanding of the transferrable physics characteristics from the pretraining environment. We demonstrate that PPOPT outperforms baseline classic PPO on small training samples both in terms of rewards gained and general training stability. While PPOPT underperforms against classic model-based methods such as DYNA DDPG, the model-free nature of PPOPT allows it to train in significantly less time than its model-based counterparts. Finally, we present our implementation of PPOPT as open-source software, available at github.com/Davidrxyang/PPOPT.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2510.10029

Genre: Research Report > Promising Solution (0.34)

Industry: Education (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Syed, Shahbaz P Qadri, Bai, He

Structured Cooperative Multi-Agent Reinforcement Learning: a Bayesian Network Perspective

The empirical success of multi-agent reinforcement learning (MARL) has motivated the search for more efficient and scalable algorithms for large scale multi-agent systems. However, existing state-of-the-art algorithms do not fully exploit inter-agent coupling information to develop MARL algorithms. In this paper, we propose a systematic approach to leverage structures in the inter-agent couplings for efficient model-free reinforcement learning. We model the cooperative MARL problem via a Bayesian network and characterize the subset of agents, termed as the value dependency set, whose information is required by each agent to estimate its local action value function exactly. Moreover, we propose a partially decentralized training decentralized execution (P-DTDE) paradigm based on the value dependency set. We theoretically establish that the total variance of our P-DTDE policy gradient estimator is less than the centralized training decentralized execution (CTDE) policy gradient estimator. We derive a multi-agent policy gradient theorem based on the P-DTDE scheme and develop a scalable actor-critic algorithm. We demonstrate the efficiency and scalability of the proposed algorithm on multi-warehouse resource allocation and multi-zone temperature control examples. For dense value dependency sets, we propose an approximation scheme based on truncation of the Bayesian network and empirically show that it achieves a faster convergence than the exact value dependence set for applications with a large number of agents.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2510.09937

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.80)

Khadka, Sudip, Paudel, L. S.

A Multi-Component Reward Function with Policy Gradient for Automated Feature Selection with Dynamic Regularization and Bias Mitigation

Static feature exclusion strategies often fail to prevent bias when hidden dependencies influence the model predictions. To address this issue, we explore a reinforcement learning (RL) framework that integrates bias mitigation and automated feature selection within a single learning process. Unlike traditional heuristic-driven filter or wrapper approaches, our RL agent adaptively selects features using a reward signal that explicitly integrates predictive performance with fairness considerations. This dynamic formulation allows the model to balance generalization, accuracy, and equity throughout the training process, rather than rely exclusively on pre-processing adjustments or post hoc correction mechanisms. In this paper, we describe the construction of a multi-component reward function, the specification of the agents action space over feature subsets, and the integration of this system with ensemble learning. We aim to provide a flexible and generalizable way to select features in environments where predictors are correlated and biases can inadvertently re-emerge.

feature selection, machine learning, reinforcement learning, (16 more...)

2510.09705

Country: North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Health & Medicine (1.00)
Banking & Finance > Credit (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)

Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models

Wu, Hao, Gao, Yuan, Shi, Xingjian, Li, Shuaipeng, Xu, Fan, Zhang, Fan, Zhu, Zhihong, Wang, Weiyan, Luo, Xiao, Wang, Kun, Wu, Xian, Huang, Xiaomeng

To address the dual challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, we propose Spatiotemporal Forecasting as Planning (SFP), a new paradigm grounded in Model-Based Reinforcement Learning. SFP constructs a novel Generative World Model to simulate diverse, high-fidelity future states, enabling an "imagination-based" environmental simulation. Within this framework, a base forecasting model acts as an agent, guided by a beam search-based planning algorithm that leverages non-differentiable domain metrics as reward signals to explore high-return future sequences. These identified high-reward candidates then serve as pseudo-labels to continuously optimize the agent's policy through iterative self-training, significantly reducing prediction error and demonstrating exceptional performance on critical domain metrics like capturing extreme events.

machine learning, reinforcement learning, world model, (16 more...)

2510.0402

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Behjati, Mehran, Nordin, Rosdiadee, Abdullah, Nor Fadzilah

Maximizing UAV Cellular Connectivity with Reinforcement Learning for BVLoS Path Planning

This paper presents a reinforcement learning (RL) based approach for path planning of cellular connected unmanned aerial vehicles (UAVs) operating beyond visual line of sight (BVLoS). The objective is to minimize travel distance while maximizing the quality of cellular link connectivity by considering real world aerial coverage constraints and employing an empirical aerial channel model. The proposed solution employs RL techniques to train an agent, using the quality of communication links between the UAV and base stations (BSs) as the reward function. Simulation results demonstrate the effectiveness of the proposed method in training the agent and generating feasible UAV path plans. The proposed approach addresses the challenges due to limitations in UAV cellular communications, highlighting the need for investigations and considerations in this area. The RL algorithm efficiently identifies optimal paths, ensuring maximum connectivity with ground BSs to ensure safe and reliable BVLoS flight operation. Moreover, the solution can be deployed as an offline path planning module that can be integrated into future ground control systems (GCS) for UAV operations, enhancing their capabilities and safety. The method holds potential for complex long range UAV applications, advancing the technology in the field of cellular connected UAV path planning.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2509.13336

Genre: Research Report > New Finding (0.49)

Industry:

Telecommunications (1.00)
Transportation > Air (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Taghavi, Mazyar, Farnoosh, Rahman

Latent Variable Modeling in Multi-Agent Reinforcement Learning via Expectation-Maximization for UAV-Based Wildlife Protection

I N T R O D U C T I O N T h e I r a n i a n l e o p a r d ( P a n t h e r a p a rd u s t u l l i a n a), a subspecies of the P ersian leopard, is critically endangered due to illegal poaching, habitat fragmentation, and h u m a n - w i l d l i f e c o n f l i c t. C o n s e r v a t i o n e f f o r t s a r e i n c r e a s i n g l y t u r n i n g t o t e c h n o l o g y f o r i n n o v a t i v e m o n i t o r i n g a n d i n t e r v e n t i o n m e t h o d s . Metric 10 Agents T raining Time (hrs) Memor y Usage (GB) CPU Utilization (%) GPU Utilization (%) T raining Time Increase (%) Memor y Usage Increase (%) 5.2 4.5 65 45 - - 20 Agents 50 Agents 6.3 5.1 75 55 20 15 8.0 6.8 85 70 53 51 T able 4. P ercentage of High-Risk zones Covered by Each Method (Mean std) F igure 3. P oacher Detection R ate Across Episodes. Higher Entropy Indicates More Diverse Exploration T able 5. KL Divergence between Inferred q(z) and Ground T ruth T ask Distribution T h e E M - b a s e d p o l i c y e x h i b i t s a n i n i t i a l l y h i g h e n t r o p y, e n c o u r a g i n g d i v e r s e a c t i o n s a m p l i n g, a n d g r a d u a l l y an n e a l s as th e po l i c y be c o m e s co n f i d e n t . Metric Cooperative Coverage Number of Agents Involved Coverage Efficiency (%) P oa ch er D et ec ti on R at e (%) Collision Incidents 6 85.3 - 0 P oacher Detection Coordination Conflict A voidance 8 - 92.1 0 10 - - 0 It enables conser vationists and security forces to allocate limited resources more effectiv e l y a n d a c t i n r e a l t i m e b a s e d o n a c t i o n a b l e i n t e l l i g e n c e d e r i v e d f r o m a u t o n o m o u s a g e n t s .

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2509.02579

Country: North America > United States (0.24)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.69)

Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

Gao, Xiancheng, Shi, Yufeng, Zhou, Wengang, Li, Houqiang

Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2508.15327

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)