initial state
Supplementary Material
Then each deterministic NN in {ฯw,b | (w,b) Wฯ}is safe if and only if the system of constraints ฮฆ(ฯ,X0,Xu,) is not satisfiable. We prove the equivalent claim that there exists a weight vector (w,b) Wฯ for which ฯw,b is unsafe if and only if ฮฆ(ฯ,X0,Xu,) is satisfiable. First, suppose that there exists a weight vector (w,b) Wฯ for which ฯw,b is unsafe and we want to show that ฮฆ(ฯ,X0,Xu,) is satisfiable. This direction of the proof is straightforward since values of the network's neurons on the unsafe input give rise to a solution of ฮฆ(ฯ,X0,Xu,). Indeed, by assumption there exists a vector of input neuron values x0 X0 for which the corresponding vector of output neuron values xl = ฯw,b(x0) is unsafe, i.e. xl Xu.
Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets
Offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data. The primary motivation for using reinforcement learning (RL) instead of supervised learning techniques such as behavior cloning is to find a policy that achieves a higher average return than the trajectories constituting the dataset. However, we empirically find that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset. We argue this is due to an assumption made by current offline RL algorithms of staying close to the trajectories in the dataset. If the dataset primarily consists of sub-optimal trajectories, this assumption forces the policy to mimic the suboptimal actions. We overcome this issue by proposing a sampling strategy that enables the policy to only be constrained to "good data" rather than all actions in the dataset (i.e., uniform sampling). We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms. Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms.
One-Shot Imitation Learning
Imitation learning has been commonly applied to solve different tasks in isolation. This usually requires either careful feature engineering, or a significant number of samples. This is far from what we desire: ideally, robots should be able to learn from very few demonstrations of any given task, and instantly generalize to new situations of the same task, without requiring task-specific engineering. In this paper, we propose a meta-learning framework for achieving such capability, which we call one-shot imitation learning. Specifically, we consider the setting where there is a very large (maybe infinite) set of tasks, and each task has many instantiations.