This paper considers the problem of efficient exploration of unseen environments, a key challenge in AI. We propose a learning to explore' framework where we learn a policy from a distribution of environments. At test time, presented with an unseen environment from the same distribution, the policy aims to generalize the exploration strategy to visit the maximum number of unique states in a limited number of steps. We particularly focus on environments with graph-structured state-spaces that are encountered in many important real-world applications like software testing and map building. We formulate this task as a reinforcement learning problem where the exploration' agent is rewarded for transitioning to previously unseen environment states and employ a graph-structured memory to encode the agent's past trajectory.
Niu, Sufeng (Clemson University) | Chen, Siheng (Uber Advanced Technologies Group ) | Guo, Hanyu (Clemson University) | Targonski, Colin (Clemson University) | Smith, Melissa C. (Clemson University) | Kovačević, Jelena (Carnegie Mellon University)
In this paper, we introduce a generalized value iteration network (GVIN), which is an end-to-end neural network planning module. GVIN emulates the value iteration algorithm by using a novel graph convolution operator, which enables GVIN to learn and plan on irregular spatial graphs. We propose three novel differentiable kernels as graph convolution operators and show that the embedding-based kernel achieves the best performance. Furthermore, we present episodic Q-learning, an improvement upon traditional n-step Q-learning that stabilizes training for VIN and GVIN. Lastly, we evaluate GVIN on planning problems in 2D mazes, irregular graphs, and real-world street networks, showing that GVIN generalizes well for both arbitrary graphs and unseen graphs of larger scaleand outperforms a naive generalization of VIN (discretizing a spatial graph into a 2D image).
We present Placeto, a reinforcement learning (RL) approach to efficiently find device placements for distributed neural network training. Unlike prior approaches that only find a device placement for a specific computation graph, Placeto can learn generalizable device placement policies that can be applied to any graph. We propose two key ideas in our approach: (1) we represent the policy as performing iterative placement improvements, rather than outputting a placement in one shot; (2) we use graph embeddings to capture relevant information about the structure of the computation graph, without relying on node labels for indexing. These ideas allow Placeto to train efficiently and generalize to unseen graphs. Our experiments show that Placeto requires up to 6.1x fewer training steps to find placements that are on par with or better than the best placements found by prior approaches.
We present a new graph-based approach for incorporating domain knowledge in reinforcement learning applications. The domain knowledge is given as a weighted graph, or a kernel matrix, that loosely indicates which states should have similar optimal actions. We first introduce a bias into the policy search process by deriving a distribution on policies such that policies that disagree with the provided graph have low probabilities. This distribution corresponds to a Markov Random Field. We then present a reinforcement and an apprenticeship learning algorithms for finding such policy distributions. We also illustrate the advantage of the proposed approach on three problems: swing-up cart-balancing with nonuniform and smooth frictions, gridworlds, and teaching a robot to grasp new objects.
Groshev, Edward (University of California, Berkeley) | Tamar, Aviv (University of California, Berkeley) | Goldstein, Maxwell (Princeton University) | Srivastava, Siddharth (Arizona State University) | Abbeel, Pieter (University of California Berkeley)
We present a new approach to learning for planning, where knowledge acquired while solving a given set of planning problems is used to plan faster in related, but new problem instances. We show that a deep neural network can be used to learn and represent a generalized reactive policy (GRP) that maps a problem instance and a state to an action, and that the learned GRPs efficiently solve large classes of challenging problem instances. In contrast to prior efforts in this direction, our approach significantly reduces the dependence of learning on handcrafted domain knowledge or feature selection. Instead, the GRP is trained from scratch using a set of successful execution traces. We show that our approach can also be used to automatically learn a heuristic function that can be used in directed search algorithms. We evaluate our approach using an extensive suite of experiments on two challenging planning problem domains and show that our approach facilitates learning complex decision making policies and powerful heuristic functions with minimal human input