AITopics

While reinforcement learning from scratch has shown impressive results in solving sequential decision-making tasks with efficient simulators, real-world applications with expensive interactions require more sample-efficient agents. Foundation models (FMs) are natural candidates to improve sample efficiency as they possess broad knowledge and reasoning capabilities, but it is yet unclear how to effectively integrate them into the reinforcement learning framework. In this paper, we anticipate and, most importantly, evaluate two promising strategies. First, we consider the use of foundation world models (FWMs) that exploit the prior knowledge of FMs to enable training and evaluating agents with simulated interactions. Second, we consider the use of foundation agents (FAs) that exploit the reasoning capabilities of FMs for decision-making. We evaluate both approaches empirically in a family of grid-world environments that are suitable for the current generation of large language models (LLMs). Our results suggest that improvements in LLMs already translate into better FWMs and FAs; that FAs based on current LLMs can already provide excellent policies for sufficiently simple environments; and that the coupling of FWMs and reinforcement learning agents is highly promising for more complex settings with partial observability and stochastic elements.

large language model, machine learning, reinforcement learning, (18 more...)

2509.15915

Country: Europe (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Qin, Chuhao, Narayanan, Arun, Pournaras, Evangelos

Coordinated Multi-Drone Last-mile Delivery: Learning Strategies for Energy-aware and Timely Operations

Abstract--Drones have recently emerged as a faster, safer, and cost-efficient way for last-mile deliveries of parcels, particularly for urgent medical deliveries highlighted during the pandemic. This paper addresses a new challenge of multi-parcel delivery with a swarm of energy-aware drones, accounting for time-sensitive customer requirements. Each drone plans an optimal multi-parcel route within its battery-restricted flight range to minimize delivery delays and reduce energy consumption. The problem is tackled by decomposing it into three sub-problems: (1) optimizing depot locations and service areas using K-means clustering; (2) determining the optimal flight range for drones through reinforcement learning; and (3) planning and selecting multi-parcel delivery routes via a new optimized plan selection approach. T o integrate these solutions and enhance long-term efficiency, we propose a novel algorithm leveraging actor-critic-based multi-agent deep reinforcement learning. Extensive experimentation using realistic delivery datasets demonstrate an exceptional performance of the proposed algorithm. We provide new insights into economic efficiency (minimize energy consumption), rapid operations (reduce delivery delays and overall execution time), and strategic guidance on depot deployment for practical logistics applications. Unmanned aerial vehicles (UA Vs), commonly known as drones, have gained significant attention as a solution for last-mile delivery, especially in recent years [1]. For instance, the COVID-19 pandemic has highlighted the vulnerabilities of traditional delivery methods, as deliverymen risk spreading the virus. This was particularly problematic in quarantine zones, where customers faced difficulties in accessing logistics services [2], [3]. In contrast, drones offer a safer and more flexible alternative. Due to their high mobility, carrying capacity, and accurate GPS navigation, drones are able to deliver parcels directly to small places such as doorways and balconies, avoiding human contact and traffic congestion.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2509.1583

Country: Europe > United Kingdom (0.67)

Genre: Research Report (1.00)

Industry:

Transportation > Freight & Logistics Services (1.00)
Health & Medicine (1.00)
Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Learning to Optimize Capacity Planning in Semiconductor Manufacturing

Andelfinger, Philipp, Bi, Jieyi, Zhu, Qiuyu, Zhou, Jianan, Zhang, Bo, Zhang, Fei Fei, Chan, Chew Wye, Gan, Boon Ping, Cai, Wentong, Zhang, Jie

In manufacturing, capacity planning is the process of allocating production resources in accordance with variable demand. The current industry practice in semiconductor manufacturing typically applies heuristic rules to prioritize actions, such as future change lists that account for incoming machine and recipe dedications. However, while offering interpretability, heuristics cannot easily account for the complex interactions along the process flow that can gradually lead to the formation of bottlenecks. Here, we present a neural network-based model for capacity planning on the level of individual machines, trained using deep reinforcement learning. By representing the policy using a heterogeneous graph neural network, the model directly captures the diverse relationships among machines and processing steps, allowing for proactive decision-making. We describe several measures taken to achieve sufficient scalability to tackle the vast space of possible machine-level actions. Our evaluation results cover Intel's small-scale Minifab model and preliminary experiments using the popular SMT2020 testbed. In the largest tested scenario, our trained policy increases throughput and decreases cycle time by about 1.8% each.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2509.15767

Country: Asia > Singapore (0.15)

Genre: Research Report (0.82)

Industry:

Semiconductors & Electronics (1.00)
Information Technology > Hardware (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

CCrepairBench: A High-Fidelity Benchmark and Reinforcement Learning Framework for C++ Compilation Repair

Sun, Weixuan, Zhai, Jucai, Liu, Dengfeng, Zhang, Xin, Wu, Xiaojun, Hao, Qiaobo, AIMgroup, null, Fang, Yang, Tang, Jiuyang

The automated repair of C++ compilation errors presents a significant challenge, the resolution of which is critical for developer productivity. Progress in this domain is constrained by two primary factors: the scarcity of large-scale, high-fidelity datasets and the limitations of conventional supervised methods, which often fail to generate semantically correct patches.This paper addresses these gaps by introducing a comprehensive framework with three core contributions. First, we present CCrepair, a novel, large-scale C++ compilation error dataset constructed through a sophisticated generate-and-verify pipeline. Second, we propose a Reinforcement Learning (RL) paradigm guided by a hybrid reward signal, shifting the focus from mere compilability to the semantic quality of the fix. Finally, we establish the robust, two-stage evaluation system providing this signal, centered on an LLM-as-a-Judge whose reliability has been rigorously validated against the collective judgments of a panel of human experts. This integrated approach aligns the training objective with generating high-quality, non-trivial patches that are both syntactically and semantically correct. The effectiveness of our approach was demonstrated experimentally. Our RL-trained Qwen2.5-1.5B-Instruct model achieved performance comparable to a Qwen2.5-14B-Instruct model, validating the efficiency of our training paradigm. Our work provides the research community with a valuable new dataset and a more effective paradigm for training and evaluating robust compilation repair models, paving the way for more practical and reliable automated programming assistants.

large language model, machine learning, reinforcement learning, (19 more...)

2509.1569

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)

Suzuki, Kyohei, Slavakis, Konstantinos

Nonconvex Regularization for Feature Selection in Reinforcement Learning

The primary objective of RL is for an agent to learn an optimal policy to control a system by minimizing a long-term loss, represented by the Q-function. This learning occurs through interactions with the environment, which is typically modeled as a Markov decision process (MDP). In most high-dimensional, real-world problems, explicitly representing the Q-function for all possible states and actions is impractical due to the "curse of dimensionality." A common solution is to approximate the Q-function using a parametric (functional) representation. This, however, introduces a fundamental trade-off between approximation accuracy and computational complexity: reducing the approximation error generally requires a large number of features in the parametric model, which in turn increases computational demands. Feature selection, achieved via a sparse representation over a large basis of functions, is an effective way to alleviate this tradeoff, mitigate overfitting, and improve sample efficiency.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2509.15652

Country: Asia > Japan (0.46)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Aref, Zahra, Mandayam, Narayan B.

Mental Accounts for Actions: EWA-Inspired Attention in Decision Transformers

Transformers have emerged as a compelling architecture for sequential decision-making by modeling trajectories via self-attention. In reinforcement learning (RL), they enable return-conditioned control without relying on value function approximation. Decision Transformers (DTs) exploit this by casting RL as supervised sequence modeling, but they are restricted to offline data and lack exploration. Online Decision Transformers (ODTs) address this limitation through entropy-regularized training on on-policy rollouts, offering a stable alternative to traditional RL methods like Soft Actor-Critic, which depend on bootstrapped targets and reward shaping. Despite these advantages, ODTs use standard attention, which lacks explicit memory of action-specific outcomes. This leads to inefficiencies in learning long-term action effectiveness. Inspired by cognitive models such as Experience-Weighted Attraction (EWA), we propose Experience-Weighted Attraction with Vector Quantization for Online Decision Transformers (EWA-VQ-ODT), a lightweight module that maintains per-action mental accounts summarizing recent successes and failures. Continuous actions are routed via direct grid lookup to a compact vector-quantized codebook, where each code stores a scalar attraction updated online through decay and reward-based reinforcement. These attractions modulate attention by biasing the columns associated with action tokens, requiring no change to the backbone or training objective. On standard continuous-control benchmarks, EWA-VQ-ODT improves sample efficiency and average return over ODT, particularly in early training. The module is computationally efficient, interpretable via per-code traces, and supported by theoretical guarantees that bound the attraction dynamics and its impact on attention drift.

machine learning, natural language, reinforcement learning, (14 more...)

2509.15498

Genre:

Instructional Material (0.48)
Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
(2 more...)

Taschin, Federico, Lazaraq, Abderrahmane, Tonguz, Ozan K., Ozgunes, Inci

The Distribution Shift Problem in Transportation Networks using Reinforcement Learning and AI

Abstract--The use of Machine Learning (ML) and Artificial Intelligence (AI) in smart transportation networks has increased significantly in the last few years. Among these ML and AI approaches, Reinforcement Learning (RL) has been shown to be a very promising approach by several authors. However, a problem with using Reinforcement Learning in Traffic Signal Control is the reliability of the trained RL agents due to the dynamically changing distribution of the input data with respect to the distribution of the data used for training. This presents a major challenge and a reliability problem for the trained network of AI agents and could have very undesirable and even detrimental consequences if a suitable solution is not found. Several researchers have tried to address this problem using different approaches. In particular, Meta Reinforcement Learning (Meta RL) promises to be an effective solution. In this paper, we evaluate and analyze a state-of-the-art Meta RL approach called MetaLight and show that, while under certain conditions MetaLight can indeed lead to reasonably good results, under some other conditions it might not perform well (with errors of up to 22%), suggesting that Meta RL schemes are often not robust enough and can even pose major reliability problems. As cities become more populated and the number of vehicles on their roads increases, the problem of efficiently controlling the flow of vehicles to reduce travel times and CO2 emissions is becoming a top priority. For this reason, in recent years, research in Traffic Signal Control has gained significant momentum as the quest to develop better Traffic Signal Control algorithms intensified. Specifically, Deep Reinforcement Learning (Deep RL) gained much attention in the research community as it better captures the sequential decision-making nature of the problem.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2509.15291

Country: North America > United States > Utah (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning

Liu, Chi, Li, Derek, Shu, Yan, Chen, Robin, Duan, Derek, Fang, Teng, Dai, Bryan

While large language models show promise in medical applications, achieving expert-level clinical reasoning remains challenging due to the need for both accurate answers and transparent reasoning processes. To address this challenge, we introduce Fleming-R1, a model designed for verifiable medical reasoning through three complementary innovations. First, our Reasoning-Oriented Data Strategy (RODS) combines curated medical QA datasets with knowledge-graph-guided synthesis to improve coverage of underrepresented diseases, drugs, and multi-hop reasoning chains. Second, we employ Chain-of-Thought (CoT) cold start to distill high-quality reasoning trajectories from teacher models, establishing robust inference priors. Third, we implement a two-stage Reinforcement Learning from Verifiable Rewards (RLVR) framework using Group Relative Policy Optimization, which consolidates core reasoning skills while targeting persistent failure modes through adaptive hard-sample mining. Across diverse medical benchmarks, Fleming-R1 delivers substantial parameter-efficient improvements: the 7B variant surpasses much larger baselines, while the 32B model achieves near-parity with GPT-4o and consistently outperforms strong open-source alternatives. These results demonstrate that structured data design, reasoning-oriented initialization, and verifiable reinforcement learning can advance clinical reasoning beyond simple accuracy optimization. We release Fleming-R1 publicly to promote transparent, reproducible, and auditable progress in medical AI, enabling safer deployment in high-stakes clinical environments.

large language model, machine learning, reinforcement learning, (19 more...)

2509.15279

Country:

North America > United States (0.46)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
(2 more...)

Shoresh, David, Loewenstein, Yonatan

Subject Matter Expertise vs Professional Management in Collective Sequential Decision Making

Your company's CEO is retiring. You search for a successor. You can promote an employee from the company familiar with the company's operations, or recruit an external professional manager. Who should you prefer? It has not been clear how to address this question, the "subject matter expertise vs. professional manager debate", quantitatively and objectively. We note that a company's success depends on long sequences of interdependent decisions, with often-opposing recommendations of diverse board members. To model this task in a controlled environment, we utilize chess - a complex, sequential game with interdependent decisions which allows for quantitative analysis of performance and expertise (since the states, actions and game outcomes are well-defined). The availability of chess engines differing in style and expertise, allows scalable experimentation. We considered a team of (computer) chess players. At each turn, team members recommend a move and a manager chooses a recommendation. We compared the performance of two manager types. For manager as "subject matter expert", we used another (computer) chess player that assesses the recommendations of the team members based on its own chess expertise. We examined the performance of such managers at different strength levels. To model a "professional manager", we used Reinforcement Learning (RL) to train a network that identifies the board positions in which different team members have relative advantage, without any pretraining in chess. We further examined this network to see if any chess knowledge is acquired implicitly. We found that subject matter expertise beyond a minimal threshold does not significantly contribute to team synergy. Moreover, performance of a RL-trained "professional" manager significantly exceeds that of even the best "expert" managers, while acquiring only limited understanding of chess.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2509.15263

Country: Asia (0.14)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Chess (1.00)

Technology:

Information Technology > Artificial Intelligence > Games > Chess (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)

Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning

Li, Simin, Yuwei, Zheng, Mao, Zihao, Wang, Linhao, Xu, Ruixiao, Ma, Chengdong, Yu, Xin, Ma, Yuqing, Dou, Qi, Wang, Xin, Luo, Jie, An, Bo, Yang, Yaodong, Lv, Weifeng, Liu, Xianglong

Partial agent failure becomes inevitable when systems scale up, making it crucial to identify the subset of agents whose compromise would most severely degrade overall performance. In this paper, we study this Vulnerable Agent Identification (VAI) problem in large-scale multi-agent reinforcement learning (MARL). We frame VAI as a Hierarchical Adversarial Decentralized Mean Field Control (HAD-MFC), where the upper level involves an NP-hard combinatorial task of selecting the most vulnerable agents, and the lower level learns worst-case adversarial policies for these agents using mean-field MARL. The two problems are coupled together, making HAD-MFC difficult to solve. To solve this, we first decouple the hierarchical process by Fenchel-Rockafellar transform, resulting a regularized mean-field Bellman operator for upper level that enables independent learning at each level, thus reducing computational complexity. We then reformulate the upper-level combinatorial problem as a MDP with dense rewards from our regularized mean-field Bellman operator, enabling us to sequentially identify the most vulnerable agents by greedy and RL algorithms. This decomposition provably preserves the optimal solution of the original HAD-MFC. Experiments show our method effectively identifies more vulnerable agents in large-scale MARL and the rule-based system, fooling system into worse failures, and learns a value function that reveals the vulnerability of each agent.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2509.15103

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.48)