Reinforcement Learning
The Fair Game: Auditing & Debiasing AI Algorithms Over Time
An emerging field of AI, namely Fair Machine Learning (ML), aims to quantify different types of bias (also known as unfairness) exhibited in the predictions of ML algorithms, and to design new algorithms to mitigate them. Often, the definitions of bias used in the literature are observational, i.e. they use the input and output of a pre-trained algorithm to quantify a bias under concern. In reality,these definitions are often conflicting in nature and can only be deployed if either the ground truth is known or only in retrospect after deploying the algorithm. Thus,there is a gap between what we want Fair ML to achieve and what it does in a dynamic social environment. Hence, we propose an alternative dynamic mechanism,"Fair Game",to assure fairness in the predictions of an ML algorithm and to adapt its predictions as the society interacts with the algorithm over time. "Fair Game" puts together an Auditor and a Debiasing algorithm in a loop around an ML algorithm. The "Fair Game" puts these two components in a loop by leveraging Reinforcement Learning (RL). RL algorithms interact with an environment to take decisions, which yields new observations (also known as data/feedback) from the environment and in turn, adapts future decisions. RL is already used in algorithms with pre-fixed long-term fairness goals. "Fair Game" provides a unique framework where the fairness goals can be adapted over time by only modifying the auditor and the different biases it quantifies. Thus,"Fair Game" aims to simulate the evolution of ethical and legal frameworks in the society by creating an auditor which sends feedback to a debiasing algorithm deployed around an ML system. This allows us to develop a flexible and adaptive-over-time framework to build Fair ML systems pre- and post-deployment.
Lightweight Auto-bidding based on Traffic Prediction in Live Advertising
Yang, Bo, Luo, Ruixuan, Jin, Junqi, Zhu, Han
Internet live streaming is widely used in online entertainment and e-commerce, where live advertising is an important marketing tool for anchors. An advertising campaign hopes to maximize the effect (such as conversions) under constraints (such as budget and cost-per-click). The mainstream control of campaigns is auto-bidding, where the performance depends on the decision of the bidding algorithm in each request. The most widely used auto-bidding algorithms include Proportional-Integral-Derivative (PID) control, linear programming (LP), reinforcement learning (RL), etc. Existing methods either do not consider the entire time traffic, or have too high computational complexity. In this paper, the live advertising has high requirements for real-time bidding (second-level control) and faces the difficulty of unknown future traffic. Therefore, we propose a lightweight bidding algorithm Binary Constrained Bidding (BiCB), which neatly combines the optimal bidding formula given by mathematical analysis and the statistical method of future traffic estimation, and obtains good approximation to the optimal result through a low complexity solution. In addition, we complement the form of upper and lower bound constraints for traditional auto-bidding modeling and give theoretical analysis of BiCB. Sufficient offline and online experiments prove BiCB's good performance and low engineering cost.
Unsupervised Partner Design Enables Robust Ad-hoc Teamwork
Ruhdorfer, Constantin, Bortoletto, Matteo, Oei, Victor, Penzkofer, Anna, Bulling, Andreas
We introduce Unsupervised Partner Design (UPD) - a population-free, multi-agent reinforcement learning framework for robust ad-hoc teamwork that adaptively generates training partners without requiring pretrained partners or manual parameter tuning. UPD constructs diverse partners by stochastically mixing an ego agent's policy with biased random behaviours and scores them using a variance-based learnability metric that prioritises partners near the ego agent's current learning frontier. We show that UPD can be integrated with unsupervised environment design, resulting in the first method enabling fully unsupervised curricula over both level and partner distributions in a cooperative setting. Through extensive evaluations on Overcooked-AI and the Overcooked Generalisation Challenge, we demonstrate that this dynamic partner curriculum is highly effective: UPD consistently outperforms both population-based and population-free baselines as well as ablations. In a user study, we further show that UPD achieves higher returns than all baselines and was perceived as significantly more adaptive, more human-like, a better collaborator, and less frustrating.
L2Calib: $SE(3)$-Manifold Reinforcement Learning for Robust Extrinsic Calibration with Degenerate Motion Resilience
Li, Baorun, Zhu, Chengrui, Du, Siyi, Chen, Bingran, Ren, Jie, Wang, Wenfei, Liu, Yong, Lv, Jiajun
-- Extrinsic calibration is essential for multi-sensor fusion, existing methods rely on structured targets or fully-excited data, limiting real-world applicability. Online calibration further suffers from weak excitation, leading to unreliable estimates. T o address these limitations, we propose a reinforcement learning (RL)-based extrinsic calibration framework that formulates extrinsic calibration as a decision-making problem, directly optimizes SE (3) extrinsics to enhance odometry accuracy. Our approach leverages a probabilistic Bingham distribution to model 3D rotations, ensuring stable optimization while inherently retaining quaternion symmetry. A trajectory alignment reward mechanism enables robust calibration without structured targets by quantitatively evaluating estimated tightly-coupled trajectory against a reference trajectory. Additionally, an automated data selection module filters uninformative samples, significantly improving efficiency and scalability for large-scale datasets. Extensive experiments on UA Vs, UGVs, and handheld platforms demonstrate that our method outperforms traditional optimization-based approaches, achieving high-precision calibration even under weak excitation conditions. The code is available at https://github.com/
OM2P: Offline Multi-Agent Mean-Flow Policy
Li, Zhuoran, Wang, Xun, Zhong, Hai, Huang, Longbo
Generative models, especially diffusion and flow-based models, have been promising in offline multi-agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow-based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time-sensitive or resource-constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward-aware optimization scheme that integrates a carefully-designed mean-flow matching loss with Q-function supervision. Additionally, we design a generalized timestep distribution and a derivative-free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi-Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time. Our approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.
SCAR: State-Space Compression for AI-Driven Resource Management in 6G-Enabled Vehicular Infotainment Systems
Comsa, Ioan-Sorin, Shah, Purav, Vaidhyanathan, Karthik, Gangadharan, Deepak, Imhof, Christof, Bergamin, Per, Kaushik, Aryan, Muntean, Gabriel-Miro, Trestian, Ramona
The advent of 6G networks opens new possibilities for connected infotainment services in vehicular environments. However, traditional Radio Resource Management (RRM) techniques struggle with the increasing volume and complexity of data such as Channel Quality Indicators (CQI) from autonomous vehicles. To address this, we propose SCAR (State-Space Compression for AI-Driven Resource Management), an Edge AI-assisted framework that optimizes scheduling and fairness in vehicular infotainment. SCAR employs ML-based compression techniques (e.g., clustering and RBF networks) to reduce CQI data size while preserving essential features. These compressed states are used to train 6G-enabled Reinforcement Learning policies that maximize throughput while meeting fairness objectives defined by the NGMN. Simulations show that SCAR increases time in feasible scheduling regions by 14\% and reduces unfair scheduling time by 15\% compared to RL baselines without CQI compression. Furthermore, Simulated Annealing with Stochastic Tunneling (SAST)-based clustering reduces CQI clustering distortion by 10\%, confirming its efficiency. These results demonstrate SCAR's scalability and fairness benefits for dynamic vehicular networks.
GCHR : Goal-Conditioned Hindsight Regularization for Sample-Efficient Reinforcement Learning
Lei, Xing, Yang, Wenyan, Ke, Kaiqiang, Yang, Shentao, Zhang, Xuetao, Pajarinen, Joni, Wang, Donglin
Goal-conditioned reinforcement learning (GCRL) with sparse rewards remains a fundamental challenge in reinforcement learning. While hindsight experience replay (HER) has shown promise by relabeling collected trajectories with achieved goals, we argue that trajectory relabeling alone does not fully exploit the available experiences in off-policy GCRL methods, resulting in limited sample efficiency. In this paper, we propose Hindsight Goal-conditioned Regularization (HGR), a technique that generates action regularization priors based on hindsight goals. When combined with hindsight self-imitation regularization (HSR), our approach enables off-policy RL algorithms to maximize experience utilization. Compared to existing GCRL methods that employ HER and self-imitation techniques, our hindsight regularizations achieve substantially more efficient sample reuse and the best performances, which we empirically demonstrate on a suite of navigation and manipulation tasks.
ME$^3$-BEV: Mamba-Enhanced Deep Reinforcement Learning for End-to-End Autonomous Driving with BEV-Perception
Lu, Siyi, Liu, Run, Yang, Dongsheng, He, Lei
Autonomous driving systems face significant challenges in perceiving complex environments and making real-time decisions. Traditional modular approaches, while offering interpretability, suffer from error propagation and coordination issues, whereas end-to-end learning systems can simplify the design but face computational bottlenecks. This paper presents a novel approach to autonomous driving using deep reinforcement learning (DRL) that integrates bird's-eye view (BEV) perception for enhanced real-time decision-making. We introduce the \texttt{Mamba-BEV} model, an efficient spatio-temporal feature extraction network that combines BEV-based perception with the Mamba framework for temporal feature modeling. This integration allows the system to encode vehicle surroundings and road features in a unified coordinate system and accurately model long-range dependencies. Building on this, we propose the \texttt{ME$^3$-BEV} framework, which utilizes the \texttt{Mamba-BEV} model as a feature input for end-to-end DRL, achieving superior performance in dynamic urban driving scenarios. We further enhance the interpretability of the model by visualizing high-dimensional features through semantic segmentation, providing insight into the learned representations. Extensive experiments on the CARLA simulator demonstrate that \texttt{ME$^3$-BEV} outperforms existing models across multiple metrics, including collision rate and trajectory accuracy, offering a promising solution for real-time autonomous driving.
Policy Optimization in Multi-Agent Settings under Partially Observable Environments
Zhaikhan, Ainur, Khammassi, Malek, Sayed, Ali H.
This work leverages adaptive social learning to estimate partially observable global states in multi-agent reinforcement learning (MARL) problems. Unlike existing methods, the proposed approach enables the concurrent operation of social learning and reinforcement learning. Specifically, it alternates between a single step of social learning and a single step of MARL, eliminating the need for the time- and computation-intensive two-timescale learning frameworks. Theoretical guarantees are provided to support the effectiveness of the proposed method. Simulation results verify that the performance of the proposed methodology can approach that of reinforcement learning when the true state is known.
Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to $Q$-Learning
Naskar, Ankur, Thoppe, Gugan, Gupta, Vijay
Algorithms for solving nonlinear fixed-point equations-- such as average-reward Q-learning and TD-learning-- often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak-Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm's contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free O (1/ t) optimal rates for Q-learning in both average-reward and exponentially discounted settings, where t denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.