AITopics

An emerging field of AI, namely Fair Machine Learning (ML), aims to quantify different types of bias (also known as unfairness) exhibited in the predictions of ML algorithms, and to design new algorithms to mitigate them. Often, the definitions of bias used in the literature are observational, i.e. they use the input and output of a pre-trained algorithm to quantify a bias under concern. In reality,these definitions are often conflicting in nature and can only be deployed if either the ground truth is known or only in retrospect after deploying the algorithm. Thus,there is a gap between what we want Fair ML to achieve and what it does in a dynamic social environment. Hence, we propose an alternative dynamic mechanism,"Fair Game",to assure fairness in the predictions of an ML algorithm and to adapt its predictions as the society interacts with the algorithm over time. "Fair Game" puts together an Auditor and a Debiasing algorithm in a loop around an ML algorithm. The "Fair Game" puts these two components in a loop by leveraging Reinforcement Learning (RL). RL algorithms interact with an environment to take decisions, which yields new observations (also known as data/feedback) from the environment and in turn, adapts future decisions. RL is already used in algorithms with pre-fixed long-term fairness goals. "Fair Game" provides a unique framework where the fairness goals can be adapted over time by only modifying the auditor and the different biases it quantifies. Thus,"Fair Game" aims to simulate the evolution of ethical and legal frameworks in the society by creating an auditor which sends feedback to a debiasing algorithm deployed around an ML system. This allows us to develop a flexible and adaptive-over-time framework to build Fair ML systems pre- and post-deployment.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

doi: 10.1017/cfl.2025.8

2508.06443

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Oceania > New Zealand (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(8 more...)

Genre: Overview (0.67)

Industry:

Leisure & Entertainment > Games (1.00)
Law > Statutes (1.00)
Information Technology > Security & Privacy (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningAug-11-2025

Lightweight Auto-bidding based on Traffic Prediction in Live Advertising

Yang, Bo, Luo, Ruixuan, Jin, Junqi, Zhu, Han

Internet live streaming is widely used in online entertainment and e-commerce, where live advertising is an important marketing tool for anchors. An advertising campaign hopes to maximize the effect (such as conversions) under constraints (such as budget and cost-per-click). The mainstream control of campaigns is auto-bidding, where the performance depends on the decision of the bidding algorithm in each request. The most widely used auto-bidding algorithms include Proportional-Integral-Derivative (PID) control, linear programming (LP), reinforcement learning (RL), etc. Existing methods either do not consider the entire time traffic, or have too high computational complexity. In this paper, the live advertising has high requirements for real-time bidding (second-level control) and faces the difficulty of unknown future traffic. Therefore, we propose a lightweight bidding algorithm Binary Constrained Bidding (BiCB), which neatly combines the optimal bidding formula given by mathematical analysis and the statistical method of future traffic estimation, and obtains good approximation to the optimal result through a low complexity solution. In addition, we complement the form of upper and lower bound constraints for traditional auto-bidding modeling and give theoretical analysis of BiCB. Sufficient offline and online experiments prove BiCB's good performance and low engineering cost.

data mining, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2508.06069

Country:

North America > Canada > Ontario > Toronto (0.05)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Italy > Piedmont > Turin Province > Turin (0.04)

Genre: Research Report (1.00)

Industry:

Marketing (1.00)
Information Technology > Services > e-Commerce Services (0.34)

Technology:

Information Technology > Data Science > Data Mining (0.91)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Ruhdorfer, Constantin, Bortoletto, Matteo, Oei, Victor, Penzkofer, Anna, Bulling, Andreas

Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

We introduce Unsupervised Partner Design (UPD) - a population-free, multi-agent reinforcement learning framework for robust ad-hoc teamwork that adaptively generates training partners without requiring pretrained partners or manual parameter tuning. UPD constructs diverse partners by stochastically mixing an ego agent's policy with biased random behaviours and scores them using a variance-based learnability metric that prioritises partners near the ego agent's current learning frontier. We show that UPD can be integrated with unsupervised environment design, resulting in the first method enabling fully unsupervised curricula over both level and partner distributions in a cooperative setting. Through extensive evaluations on Overcooked-AI and the Overcooked Generalisation Challenge, we demonstrate that this dynamic partner curriculum is highly effective: UPD consistently outperforms both population-based and population-free baselines as well as ablations. In a user study, we further show that UPD achieves higher returns than all baselines and was perceived as significantly more adaptive, more human-like, a better collaborator, and less frustrating.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2508.06336

Country: Europe > Austria (0.28)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

L2Calib: $SE(3)$-Manifold Reinforcement Learning for Robust Extrinsic Calibration with Degenerate Motion Resilience

Li, Baorun, Zhu, Chengrui, Du, Siyi, Chen, Bingran, Ren, Jie, Wang, Wenfei, Liu, Yong, Lv, Jiajun

-- Extrinsic calibration is essential for multi-sensor fusion, existing methods rely on structured targets or fully-excited data, limiting real-world applicability. Online calibration further suffers from weak excitation, leading to unreliable estimates. T o address these limitations, we propose a reinforcement learning (RL)-based extrinsic calibration framework that formulates extrinsic calibration as a decision-making problem, directly optimizes SE (3) extrinsics to enhance odometry accuracy. Our approach leverages a probabilistic Bingham distribution to model 3D rotations, ensuring stable optimization while inherently retaining quaternion symmetry. A trajectory alignment reward mechanism enables robust calibration without structured targets by quantitatively evaluating estimated tightly-coupled trajectory against a reference trajectory. Additionally, an automated data selection module filters uninformative samples, significantly improving efficiency and scalability for large-scale datasets. Extensive experiments on UA Vs, UGVs, and handheld platforms demonstrate that our method outperforms traditional optimization-based approaches, achieving high-precision calibration even under weak excitation conditions. The code is available at https://github.com/

calibration, machine learning, reinforcement learning, (19 more...)

2508.0633

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.98)

OM2P: Offline Multi-Agent Mean-Flow Policy

Li, Zhuoran, Wang, Xun, Zhong, Hai, Huang, Longbo

Generative models, especially diffusion and flow-based models, have been promising in offline multi-agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow-based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time-sensitive or resource-constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward-aware optimization scheme that integrates a carefully-designed mean-flow matching loss with Q-function supervision. Additionally, we design a generalized timestep distribution and a derivative-free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi-Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time. Our approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

2508.06269

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Comsa, Ioan-Sorin, Shah, Purav, Vaidhyanathan, Karthik, Gangadharan, Deepak, Imhof, Christof, Bergamin, Per, Kaushik, Aryan, Muntean, Gabriel-Miro, Trestian, Ramona

SCAR: State-Space Compression for AI-Driven Resource Management in 6G-Enabled Vehicular Infotainment Systems

The advent of 6G networks opens new possibilities for connected infotainment services in vehicular environments. However, traditional Radio Resource Management (RRM) techniques struggle with the increasing volume and complexity of data such as Channel Quality Indicators (CQI) from autonomous vehicles. To address this, we propose SCAR (State-Space Compression for AI-Driven Resource Management), an Edge AI-assisted framework that optimizes scheduling and fairness in vehicular infotainment. SCAR employs ML-based compression techniques (e.g., clustering and RBF networks) to reduce CQI data size while preserving essential features. These compressed states are used to train 6G-enabled Reinforcement Learning policies that maximize throughput while meeting fairness objectives defined by the NGMN. Simulations show that SCAR increases time in feasible scheduling regions by 14\% and reduces unfair scheduling time by 15\% compared to RL baselines without CQI compression. Furthermore, Simulated Annealing with Stochastic Tunneling (SAST)-based clustering reduces CQI clustering distortion by 10\%, confirming its efficiency. These results demonstrate SCAR's scalability and fairness benefits for dynamic vehicular networks.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2508.06243

Country: Europe > United Kingdom > England > Greater London > London (0.28)

Genre: Research Report > New Finding (0.34)

Industry:

Telecommunications (1.00)
Information Technology (1.00)
Education > Educational Setting > Online (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

GCHR : Goal-Conditioned Hindsight Regularization for Sample-Efficient Reinforcement Learning

Lei, Xing, Yang, Wenyan, Ke, Kaiqiang, Yang, Shentao, Zhang, Xuetao, Pajarinen, Joni, Wang, Donglin

Goal-conditioned reinforcement learning (GCRL) with sparse rewards remains a fundamental challenge in reinforcement learning. While hindsight experience replay (HER) has shown promise by relabeling collected trajectories with achieved goals, we argue that trajectory relabeling alone does not fully exploit the available experiences in off-policy GCRL methods, resulting in limited sample efficiency. In this paper, we propose Hindsight Goal-conditioned Regularization (HGR), a technique that generates action regularization priors based on hindsight goals. When combined with hindsight self-imitation regularization (HSR), our approach enables off-policy RL algorithms to maximize experience utilization. Compared to existing GCRL methods that employ HER and self-imitation techniques, our hindsight regularizations achieve substantially more efficient sample reuse and the best performances, which we empirically demonstrate on a suite of navigation and manipulation tasks.

machine learning, reinforcement learning, trajectory, (12 more...)

2508.06108

Genre: Research Report > New Finding (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

ME$^3$-BEV: Mamba-Enhanced Deep Reinforcement Learning for End-to-End Autonomous Driving with BEV-Perception

Lu, Siyi, Liu, Run, Yang, Dongsheng, He, Lei

Autonomous driving systems face significant challenges in perceiving complex environments and making real-time decisions. Traditional modular approaches, while offering interpretability, suffer from error propagation and coordination issues, whereas end-to-end learning systems can simplify the design but face computational bottlenecks. This paper presents a novel approach to autonomous driving using deep reinforcement learning (DRL) that integrates bird's-eye view (BEV) perception for enhanced real-time decision-making. We introduce the \texttt{Mamba-BEV} model, an efficient spatio-temporal feature extraction network that combines BEV-based perception with the Mamba framework for temporal feature modeling. This integration allows the system to encode vehicle surroundings and road features in a unified coordinate system and accurately model long-range dependencies. Building on this, we propose the \texttt{ME$^3$-BEV} framework, which utilizes the \texttt{Mamba-BEV} model as a feature input for end-to-end DRL, achieving superior performance in dynamic urban driving scenarios. We further enhance the interpretability of the model by visualizing high-dimensional features through semantic segmentation, providing insight into the learned representations. Extensive experiments on the CARLA simulator demonstrate that \texttt{ME$^3$-BEV} outperforms existing models across multiple metrics, including collision rate and trajectory accuracy, offering a promising solution for real-time autonomous driving.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2508.06074

Country: Asia > China (0.29)

Genre:

Research Report > Promising Solution (0.54)
Overview > Innovation (0.34)

Industry:

Transportation > Ground > Road (1.00)
Information Technology > Robotics & Automation (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

Zhaikhan, Ainur, Khammassi, Malek, Sayed, Ali H.

Policy Optimization in Multi-Agent Settings under Partially Observable Environments

This work leverages adaptive social learning to estimate partially observable global states in multi-agent reinforcement learning (MARL) problems. Unlike existing methods, the proposed approach enables the concurrent operation of social learning and reinforcement learning. Specifically, it alternates between a single step of social learning and a single step of MARL, eliminating the need for the time- and computation-intensive two-timescale learning frameworks. Theoretical guarantees are provided to support the effectiveness of the proposed method. Simulation results verify that the performance of the proposed methodology can approach that of reinforcement learning when the true state is known.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2508.06061

Genre: Research Report (0.50)

Industry: Education > Curriculum (0.90)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Naskar, Ankur, Thoppe, Gugan, Gupta, Vijay

Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to $Q$-Learning

Algorithms for solving nonlinear fixed-point equations-- such as average-reward Q-learning and TD-learning-- often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak-Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm's contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free O (1/ t) optimal rates for Q-learning in both average-reward and exponentially discounted settings, where t denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.

lemma 4, machine learning, reinforcement learning, (17 more...)

2508.05984

Country: Asia (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)