NeurIPS20_SafeCL
–Neural Information Processing Systems
In this section, we report the hyperparameters that we use for the students, which are CMDP solvers based on an online version of [30], and for the teachers, which are based on the GP-UCB algorithm for multi-armed bandits [44]. A.1 Students The students comprise two components: an unconstrained RL solver and a no-regret online optimizer. The first component is used to solve the unconstrained RL problem that results from optimizing the Lagrangian of a given CMDP for a fixed value of the Lagrange multipliers. For this, we use the Stable Baselines [25] implementation of the Proximal Policy Optimization (PPO) algorithm [43]. The second component is used to adapt the Lagrangian multipliers online.
Neural Information Processing Systems
Jan-26-2025, 12:23:00 GMT