Discrete GCBF Proximal Policy Optimization for Multi-agent Safe Optimal Control
Zhang, Songyuan, So, Oswin, Black, Mitchell, Fan, Chuchu
–arXiv.org Artificial Intelligence
Control policies that can achieve high task performance and satisfy safety constraints are desirable for any system, including multi-agent systems (MAS). One promising technique for ensuring the safety of MAS is distributed control barrier functions (CBF). However, it is difficult to design distributed CBF-based policies for MAS that can tackle unknown discrete-time dynamics, partial observability, changing neighborhoods, and input constraints, especially when a distributed high-performance nominal policy that can achieve the task is unavailable. To tackle these challenges, we propose DGPPO, a new framework that simultaneously learns both a discrete graph CBF which handles neighborhood changes and input constraints, and a distributed high-performance safe policy for MAS with unknown discrete-time dynamics. The results suggest that, compared with existing methods, our DGPPO framework obtains policies that achieve high task performance (matching baselines that ignore the safety constraints), and high safety rates (matching the most conservative baselines), with a constant set of hyperparameters across all environments. Multi-agent systems (MAS) have gained significant attention in recent years due to their potential applications in various domains such as warehouse robotics (Kattepur et al., 2018), autonomous vehicles (Shalev-Shwartz et al., 2016), traffic routing (Wu et al., 2020) and power systems Biagioni et al. (2022). However, a big challenge for MAS is designing distributed control policies that can achieve high task performance while ensuring safety, especially when the two are conflicting. In the single-agent continuous-time case, control barrier functions (CBF) are an effective tool to resolve the conflict via the solution of a safety filter quadratic program (QP) (Xu et al., 2015; Ames et al., 2017), minimally modifying a given performance-oriented nominal policy to be safe. While distributed CBFs have been proposed for the multi-agent (Wang et al., 2017) and partially observable cases (Zhang et al., 2025), they have a limitation of requiring known continuous-time dynamics and a nominal policy that can achieve high task performance (albeit not necessarily safely). While the aforementioned assumptions are reasonable for many applications, they do not apply when the dynamics are unknown and a performance-oriented nominal policy is not available. The challenge of requiring a nominal policy has been addressed by approaches that combine CBFs with reinforcement learning (RL) (Cheng et al., 2019; Emam et al., 2022), where the nominal policy is learned via an unconstrained RL algorithm to maximize task performance while the CBF is used as a safety filter to ensure safety.
arXiv.org Artificial Intelligence
Feb-5-2025