NeurIPS20_SafeCL

Matteo Turchetta

Neural Information Processing Systems 

In this section, we report the hyperparameters that we use for the students, which are CMDP solvers based on an online version of [30], and for the teachers, which are based on the GP-UCB algorithm for multi-armed bandits [44]. A.1 Students The students comprise two components: an unconstrained RL solver and a no-regret online optimizer. The first component is used to solve the unconstrained RL problem that results from optimizing the Lagrangian of a given CMDP for a fixed value of the Lagrange multipliers. For this, we use the Stable Baselines [25] implementation of the Proximal Policy Optimization (PPO) algorithm [43]. The second component is used to adapt the Lagrangian multipliers online.