Reinforcement Learning
Autobidding Arena: unified evaluation of the classical and RL-based autobidding algorithms
Pudovikov, Andrey, Khirianova, Alexandra, Solodneva, Ekaterina, Katrutsa, Aleksandr, Samosvat, Egor, Dorn, Yuriy
Advertisement auctions play a crucial role in revenue generation for e-commerce companies. To make the bidding procedure scalable to thousands of auctions, the automatic bidding (autobidding) algorithms are actively developed in the industry. Therefore, the fair and reproducible evaluation of autobidding algorithms is an important problem. We present a standardized and transparent evaluation protocol for comparing classical and reinforcement learning (RL) autobidding algorithms. We consider the most efficient autobidding algorithms from different classes, e.g., ones based on the controllers, RL, optimal formulas, etc., and benchmark them in the bidding environment. We utilize the most recent open-source environment developed in the industry, which accurately emulates the bidding process. Our work demonstrates the most promising use cases for the considered autobidding algorithms, highlights their surprising drawbacks, and evaluates them according to multiple metrics. We select the evaluation metrics that illustrate the performance of the autobidding algorithms, the corresponding costs, and track the budget pacing. Such a choice of metrics makes our results applicable to the broad range of platforms where autobidding is effective. The presented comparison results help practitioners to evaluate the candidate autobidding algorithms from different perspectives and select ones that are efficient according to their companies' targets.
A Markov Decision Process for Variable Selection in Branch & Bound
Strang, Paul, Alรจs, Zacharie, Bissuel, Cรดme, Juan, Olivier, Kedad-Sidhoum, Safia, Rachelson, Emmanuel
Mixed-Integer Linear Programming (MILP) is a powerful framework used to address a wide range of NP-hard combinatorial optimization problems, often solved by Branch and Bound (B&B). A key factor influencing the performance of B&B solvers is the variable selection heuristic governing branching decisions. Recent contributions have sought to adapt reinforcement learning (RL) algorithms to the B&B setting to learn optimal branching policies, through Markov Decision Processes (MDP) inspired formulations, and ad hoc convergence theorems and algorithms. In this work, we introduce BBMDP, a principled vanilla MDP formulation for variable selection in B&B, allowing to leverage a broad range of RL algorithms for the purpose of learning optimal B\&B heuristics. Computational experiments validate our model empirically, as our branching agent outperforms prior state-of-the-art RL agents on four standard MILP benchmarks.
Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models
Li, Mingen, Yu, Houjian, Huang, Yixuan, Hong, Youngjin, Choi, Changhyun
Abstract-- Long-horizon routing tasks of deformable linear objects (DLOs), such as cables and ropes, are common in industrial assembly lines and everyday life. These tasks are particularly challenging because they require robots to manipulate DLO with long-horizon planning and reliable skill execution. Successfully completing such tasks demands adapting to their nonlinear dynamics, decomposing abstract routing goals, and generating multi-step plans composed of multiple skills, all of which require accurate high-level reasoning during execution. In this paper, we propose a fully autonomous hierarchical framework for solving challenging DLO routing tasks. Given an implicit or explicit routing goal expressed in language, our framework leverages vision-language models (VLMs) for in-context high-level reasoning to synthesize feasible plans, which are then executed by low-level skills trained via reinforcement learning. T o improve robustness in long horizons, we further introduce a failure recovery mechanism that reorients the DLO into insertion-feasible states. Our approach generalizes to diverse scenes involving object attributes, spatial descriptions, as well as implicit language commands. It outperforms the next best baseline method by nearly 50% and achieves an overall success rate of 92.5% across long-horizon routing scenarios. Please refer to our project page: https:// icra2026-dloroute.github.io/DLORoute/
A Communication-Efficient Decentralized Actor-Critic Algorithm
Ren, Xiaoxing, Bastianello, Nicola, Parisini, Thomas, Malikopoulos, Andreas A.
In this paper, we study the problem of reinforcement learning in multi-agent systems where communication among agents is limited. We develop a decentralized actor-critic learning framework in which each agent performs several local updates of its policy and value function, where the latter is approximated by a multi-layer neural network, before exchanging information with its neighbors. This local training strategy substantially reduces the communication burden while maintaining coordination across the network. We establish finite-time convergence analysis for the algorithm under Markov-sampling. Specifically, to attain the $\varepsilon$-accurate stationary point, the sample complexity is of order $\mathcal{O}(\varepsilon^{-3})$ and the communication complexity is of order $\mathcal{O}(\varepsilon^{-1}ฯ^{-1})$, where tau denotes the number of local training steps. We also show how the final error bound depends on the neural network's approximation quality. Numerical experiments in a cooperative control setting illustrate and validate the theoretical findings.
News-Aware Direct Reinforcement Trading for Financial Markets
Lan, Qing-Yu, Wang, Zhan-He, Jiang, Jun-Qian, Wang, Yu-Tong, Piao, Yun-Song
The financial market is known to be highly sensitive to news. Therefore, effectively incorporating news data into quantitative trading remains an important challenge. Existing approaches typically rely on manually designed rules and/or handcrafted features. In this work, we directly use the news sentiment scores derived from large language models, together with raw price and volume data, as observable inputs for reinforcement learning. These inputs are processed by sequence models such as recurrent neural networks or Transformers to make end-to-end trading decisions. We conduct experiments using the cryptocurrency market as an example and evaluate two representative reinforcement learning algorithms, namely Double Deep Q-Network (DDQN) and Group Relative Policy Optimization (GRPO). The results demonstrate that our news-aware approach, which does not depend on handcrafted features or manually designed rules, can achieve performance superior to market benchmarks. We further highlight the critical role of time-series information in this process.
POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning
Yu, Kuai, Wu, Xiaoyu, Yan, Peishen, Yang, Qingqian, Jiang, Linshan, Wang, Hao, Hua, Yang, Song, Tao, Guan, Haibing
Federated Learning (FL) enables decentralized model training across multiple clients without exposing local data, but its distributed feature makes it vulnerable to backdoor attacks. Despite early FL backdoor attacks modifying entire models, recent studies have explored the concept of backdoor-critical (BC) layers, which poison the chosen influential layers to maintain stealthiness while achieving high effectiveness. However, existing BC layers approaches rely on rule-based selection without consideration of the interrelations between layers, making them ineffective and prone to detection by advanced defenses. In this paper, we propose POLAR (POlicy-based LAyerwise Reinforcement learning), the first pipeline to creatively adopt RL to solve the BC layer selection problem in layer-wise backdoor attack. Different from other commonly used RL paradigm, POLAR is lightweight with Bernoulli sampling. POLAR dynamically learns an attack strategy, optimizing layer selection using policy gradient updates based on backdoor success rate (BSR) improvements. To ensure stealthiness, we introduce a regularization constraint that limits the number of modified layers by penalizing large attack footprints. Extensive experiments demonstrate that POLAR outperforms the latest attack methods by up to 40% against six state-of-the-art (SOTA) defenses.
REPAIR Approach for Social-based City Reconstruction Planning in case of natural disasters
Mudassir, Ghulam, Di Marco, Antinisca, d'Aloisio, Giordano
Natural disasters always have several effects on human lives. It is challenging for governments to tackle these incidents and to rebuild the economic, social and physical infrastructures and facilities with the available resources (mainly budget and time). Governments always define plans and policies according to the law and political strategies that should maximise social benefits. The severity of damage and the vast resources needed to bring life back to normality make such reconstruction a challenge. This article is the extension of our previously published work by conducting comprehensive comparative analysis by integrating additional deep learning models plus random agent which is used as a baseline. Our prior research introduced a decision support system by using the Deep Reinforcement Learning technique for the planning of post-disaster city reconstruction, maximizing the social benefit of the reconstruction process, considering available resources, meeting the needs of the broad community stakeholders (like citizens' social benefits and politicians' priorities) and keeping in consideration city's structural constraints (like dependencies among roads and buildings). The proposed approach, named post disaster REbuilding plAn ProvIdeR (REPAIR) is generic. It can determine a set of alternative plans for local administrators who select the ideal one to implement, and it can be applied to areas of any extension. We show the application of REPAIR in a real use case, i.e., to the L'Aquila reconstruction process, damaged in 2009 by a major earthquake.
Feature-driven reinforcement learning for photovoltaic in continuous intraday trading
Abate, Arega Getaneh, Liu, Xiufeng, Liu, Ruyu, Zhang, Xiaobing
Photovoltaic (PV) operators face substantial uncertainty in generation and short-term electricity prices. Continuous intraday markets enable producers to adjust their positions in real time, potentially improving revenues and reducing imbalance costs. We propose a feature-driven reinforcement learning (RL) approach for PV intraday trading that integrates data-driven features into the state and learns bidding policies in a sequential decision framework. The problem is cast as a Markov Decision Process with a reward that balances trading profit and imbalance penalties and is solved with Proximal Policy Optimization (PPO) using a predominantly linear, interpretable policy. Trained on historical market data and evaluated out-of-sample, the strategy consistently outperforms benchmark baselines across diverse scenarios. Extensive validation shows rapid convergence, real-time inference, and transparent decision rules. Learned weights highlight the central role of market microstructure and historical features. Taken together, these results indicate that feature-driven RL offers a practical, data-efficient, and operationally deployable pathway for active intraday participation by PV producers.
Dual-Weighted Reinforcement Learning for Generative Preference Modeling
Feng, Shengyu, He, Yun, Ma, Shuang, Li, Beibin, Xiong, Yuanhao, Li, Songlin, Mandyam, Karishma, Katz-Samuels, Julian, Bi, Shengjie, Yu, Licheng, Zhang, Hejia, Sankararaman, Karthik Abinav, Fang, Han, Mansour, Riham, Yang, Yiming, Faruqui, Manaal
Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models on tasks with verifiable answers. However, extending RL to more general non-verifiable tasks, typically in the format of human preference pairs, remains both challenging and underexplored. In this work, we propose Dual-Weighted Reinforcement Learning (DWRL), a new framework for preference modeling that integrates CoT reasoning with the Bradley-Terry (BT) model via a dual-weighted RL objective that preserves preference-modeling inductive bias. DWRL approximates the maximum-likelihood objective of the BT model with two complementary weights: an instance-wise misalignment weight, which emphasizes under-trained pairs misaligned with human preference, and a group-wise (self-normalized) conditional preference score, which promotes promising thoughts. In this paper, we apply DWRL to preference modeling by training generative preference models (GPMs) to first generate a thought and then predict the human preference score. Across multiple benchmarks and model scales (Llama3 and Qwen2.5), DWRL consistently outperforms both GPM baselines and scalar models, while producing coherent, interpretable thoughts. In summary, our results position DWRL as a general framework for reasoning-enhanced preference learning beyond verifiable tasks.
Coordinated Strategies in Realistic Air Combat by Hierarchical Multi-Agent Reinforcement Learning
Selmonaj, Ardian, Del Rio, Giacomo, Schneider, Adrian, Antonucci, Alessandro
We focus explicitly on multi-agent RL methods in 3D air combat environments, while the survey [4] also includes single-agent RL and 2D dynamics. Several existing works employ techniques that are relevant to multi-agent air combat, such as tactical reward shaping [5], heterogeneous agents [6], attention-based neural networks for situational awareness [7], or communication mechanisms [8] to improve mission strategies. Curriculum Learning (CL) with gradually increasing task difficulty is applied in [9], while enhanced coordination among agents is achieved by adapted training algorithms [10]. The application of HMARL in defense contexts is comparatively limited. An HMARL approach that employs attention mechanisms and self-play is introduced in [11]. Frameworks more closely related to ours appear in [12], [13], with the former integrating CL and the latter employing heterogeneous leader-follower agents together with JSBSim. In this work, we introduce a complex 3D air combat environment and a training framework to learn hierarchical policies using reward shaping and cascaded league-play that gradually increases mission complexity under realistic and heterogeneous conditions. In contrast to prior efforts that are built on established RL algorithms such as Proximal Policy Optimization (PPO) [14], we additionally adapt the recently presented SPO algorithm [3] to the hierarchical multi-agent domain. To the best of our knowledge, this adapted setup has not yet been studied in this context and represents a significant step toward enhancing the realism of such applications.