Plotting

 Matteo Turchetta


Safe Exploration for Interactive Machine Learning

Neural Information Processing Systems

In Interactive Machine Learning (IML), we iteratively make decisions and obtain noisy observations of an unknown function. While IML methods, e.g., Bayesian optimization and active learning, have been successful in applications, on realworld systems they must provably avoid unsafe decisions. To this end, safe IML algorithms must carefully learn about a priori unknown constraints without making unsafe decisions. Existing algorithms for this problem learn about the safety of all decisions to ensure convergence. This is sample-inefficient, as it explores decisions that are not relevant for the original IML objective.


NeurIPS22_data_benchmarks

Neural Information Processing Systems

This means that shorter time horizons train for more episodes. Regardless of the training setup, we evaluate on the random weather setting. When evaluating trained policies on test-time, test-location and test-horizon we use 20 repetitions. We report the performance on these generalization tasks for the final policy obtained at the end of training.


Safe Exploration for Interactive Machine Learning

Neural Information Processing Systems

In Interactive Machine Learning (IML), we iteratively make decisions and obtain noisy observations of an unknown function. While IML methods, e.g., Bayesian optimization and active learning, have been successful in applications, on realworld systems they must provably avoid unsafe decisions. To this end, safe IML algorithms must carefully learn about a priori unknown constraints without making unsafe decisions. Existing algorithms for this problem learn about the safety of all decisions to ensure convergence. This is sample-inefficient, as it explores decisions that are not relevant for the original IML objective.



NeurIPS20_SafeCL

Neural Information Processing Systems

In this section, we report the hyperparameters that we use for the students, which are CMDP solvers based on an online version of [30], and for the teachers, which are based on the GP-UCB algorithm for multi-armed bandits [44]. A.1 Students The students comprise two components: an unconstrained RL solver and a no-regret online optimizer. The first component is used to solve the unconstrained RL problem that results from optimizing the Lagrangian of a given CMDP for a fixed value of the Lagrange multipliers. For this, we use the Stable Baselines [25] implementation of the Proximal Policy Optimization (PPO) algorithm [43]. The second component is used to adapt the Lagrangian multipliers online.



NeurIPS20_SafeCL

Neural Information Processing Systems

In this section, we report the hyperparameters that we use for the students, which are CMDP solvers based on an online version of [30], and for the teachers, which are based on the GP-UCB algorithm for multi-armed bandits [44]. A.1 Students The students comprise two components: an unconstrained RL solver and a no-regret online optimizer. The first component is used to solve the unconstrained RL problem that results from optimizing the Lagrangian of a given CMDP for a fixed value of the Lagrange multipliers. For this, we use the Stable Baselines [25] implementation of the Proximal Policy Optimization (PPO) algorithm [43]. The second component is used to adapt the Lagrangian multipliers online.


NeurIPS20_SafeCL

Neural Information Processing Systems

In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly. In such settings, the agent needs to behave safely not only after but also while learning. To achieve this, existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations during exploration with high probability, but both the probabilistic guarantees and the smoothness assumptions inherent in the priors are not viable in many scenarios of interest such as autonomous driving. This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor that saves the agent from violating constraints during learning. In this new model, the instructor needs to know neither how to do well at the task the agent is learning, nor how the environment works. Instead, it has a library of reset controllers that it activates when the agent starts behaving dangerously, preventing it from doing damage. Crucially, the choices of which reset controller to apply in which situation affect the speed of agent learning. Based on observing agents' progress, the teacher itself learns a policy for choosing the reset controllers, a curriculum, to optimize the agent's final policy reward. Our experiments use this framework in two challenging environments to induce curricula for safe and efficient learning.


Safe Exploration in Finite Markov Decision Processes with Gaussian Processes

Neural Information Processing Systems

In classical reinforcement learning agents accept arbitrary short term loss for long term gain when exploring their environment. This is infeasible for safety critical applications such as robotics, where even a single unsafe action may cause system failure or harm the environment. In this paper, we address the problem of safely exploring finite Markov decision processes (MDP). We define safety in terms of an a priori unknown safety constraint that depends on states and actions and satisfies certain regularity conditions expressed via a Gaussian process prior.