Country
Learning To Reach Goals Without Reinforcement Learning
Ghosh, Dibya, Gupta, Abhishek, Fu, Justin, Reddy, Ashwin, Devin, Coline, Eysenbach, Benjamin, Levine, Sergey
L EARNING TO R EACH G OALS WITHOUT R EINFORCEMENTL EARNING Dibya Ghosh* 1, Abhishek Gupta* 1, Justin Fu 1, Ashwin Reddy 1, Coline Devin 1 Benjamin Eysenbach 2 Sergey Levine 1 1 University of California Berkeley 2 Carnegie Mellon University A BSTRACT Imitation learning algorithms provide a simple and straightforward approach for training control policies via supervised learning. By maximizing the likelihood of good actions provided by an expert demonstrator, supervised imitation learning can produce effective policies without the algorithmic complexities and optimization challenges of reinforcement learning, at the cost of requiring an expert demonstrator to provide the demonstrations. In this paper, we ask: can we take insights from imitation learning to design algorithms that can effectively acquire optimal policies from scratch without any expert demonstrations? The key observation that makes this possible is that, in the multi-task setting, trajectories that are generated by a suboptimal policy can still serve as optimal examples for other tasks. In particular, when tasks correspond to different goals, every trajectory is a successful demonstration for the goal state that it actually reaches. We propose a simple algorithm for learning goal-reaching behaviors without any demonstrations, complicated user-provided reward functions, or complex reinforcement learning methods. Our method simply maximizes the likelihood of actions the agent actually took in its own previous rollouts, conditioned on the goal being the state that it actually reached. Although related variants of this approach have been proposed previously in imitation learning with demonstrations, we show how this approach can effectively learn goal-reaching policies from scratch. We present a theoretical result linking self-supervised imitation learning and reinforcement learning, and empirical results showing that it performs competitively with more complex reinforcement learning methods on a range of challenging goal reaching problems, while yielding advantages in terms of stability and use of offline data. 1 I NTRODUCTION Reinforcement learning (RL) algorithms hold the promise of providing a broadly-applicable tool for automating control, and the combination of high-capacity deep neural network models with RL extends their applicability to settings with complex observations and that require intricate policies. However, RL with function approximation, including deep RL, presents a challenging optimization problem. Despite years of research, current deep RL methods are far from a turnkey solution: most popular methods lack convergence guarantees (Baird, 1995; Tsitsiklis & V an Roy, 1997) or require prohibitive numbers of samples (Schulman et al., 2015; Lillicrap et al., 2015).
CLOSURE: Assessing Systematic Generalization of CLEVR Models
Bahdanau, Dzmitry, de Vries, Harm, O'Donnell, Timothy J., Murty, Shikhar, Beaudoin, Philippe, Bengio, Yoshua, Courville, Aaron
Dzmitry Bahdanau 123 Harm de Vries 2 Timothy J. O'Donnell 14 Shikhar Murty 5 Philippe Beaudoin 2 Y oshua Bengio 136 Aaron Courville 136 1 Mila, Quebec Artificial Intelligence Institute 2 Element AI 3 Universit e de Montr eal 4 McGill University 5 Stanford University 6 CIFAR Fellow Abstract The CLEVR dataset of natural-looking questions about 3D-rendered scenes has recently received much attention from the research community. A number of models have been proposed for this task, many of which achieved very high accuracies of around 97-99%. In this work, we study how systematic the generalization of such models is, that is to which extent they are capable of handling novel combinations of known linguistic constructs. To this end, we test models' understanding of referring expressions based on matching object properties (such as e.g. "the object that is the same size as the red ball") in novel contexts. Our experiments on the thereby constructed CLOSURE benchmark show that state-of-the-art models often do not exhibit systematicity after being trained on CLEVR. Surprisingly, we find that an explicitly compositional Neural Module Network model also generalizes badly on CLOSURE, even when it has access to the ground-truth programs at test time. We improve the NMN's systematic generalization by developing a novel V ector-NMN module architecture with vector-valued inputs and outputs. Lastly, we investigate the extent to which few-shot transfer learning can help models that are pretrained on CLEVR to adapt to CLOSURE. Our few-shot learning experiments contrast the adaptation behavior of the models with intermediate discrete programs with that of the end-to-end continuous models. 1 Introduction The ability to communicate in natural language and ground it effectively into our rich unstructured 3D reality is a crucial skill that we expect from artificial agents of the future. A popular task to benchmark progress towards this goal is Visual Question Answering (VQA), in which one must give a (typically short) answer to a question about the content of an image.
The PlayStation Reinforcement Learning Environment (PSXLE)
Purves, Carlos, Cangea, Cătălina, Veličković, Petar
We propose a new benchmark environment for evaluating Reinforcement Learning (RL) algorithms: the PlayStation Learning Environment (PSXLE), a PlayStation emulator modified to expose a simple control API that enables rich game-state representations. We argue that the PlayStation serves as a suitable progression for agent evaluation and propose a framework for such an evaluation. We build an action-driven abstraction for a PlayStation game with support for the OpenAI Gym interface and demonstrate its use by running OpenAI Baselines.
Graph Neural Networks for Decentralized Multi-Robot Path Planning
Li, Qingbiao, Gama, Fernando, Ribeiro, Alejandro, Prorok, Amanda
Efficient and collision-free navigation in multi-robot systems is fundamental to advancing mobility. Scenarios where the robots are restricted in observation and communication range call for decentralized solutions, whereby robots execute localized planning policies. From the point of view of an individual robot, however, its local decision-making system is incomplete, since other agents' unobservable states affect future values. The manner in which information is shared is crucial to the system's performance, yet is not well addressed by current approaches. To address these challenges, we propose a combined architecture, with the goal of learning a decentralized sequential action policy that yields efficient path plans for all robots. Our framework is composed of a convolutional neural network (CNN) that extracts adequate features from local observations, and a graph neural network (GNN) that communicates these features among robots. We train the model to imitate an expert algorithm, and use the resulting model online in decentralized planning involving only local communication. We evaluate our method in simulations involving teams of robots in cluttered workspaces. We measure the success rates and sum of costs over the planned paths. The results show a performance close to that of our expert algorithm, demonstrating the validity of our approach. In particular, we show our model's capability to generalize to previously unseen cases (involving larger environments and larger robot teams).
Game Design for Eliciting Distinguishable Behavior
Yang, Fan, Leqi, Liu, Wu, Yifan, Lipton, Zachary C., Ravikumar, Pradeep, Cohen, William W., Mitchell, Tom
The ability to inferring latent psychological traits from human behavior is key to developing personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing \emph{behavior diagnostic games} that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. Our framework is instantiated by using prospect theory to model varying player traits, and Markov Decision Processes to parameterize the games. We validate our approach empirically, showing that our designed games can successfully distinguish among players with different traits, outperforming manually-designed ones by a large margin.
Extending Machine Language Models toward Human-Level Language Understanding
McClelland, James L., Hill, Felix, Rudolph, Maja, Baldridge, Jason, Schütze, Hinrich
Language is central to human intelligence. We review recent breakthroughs in machine language processing and consider what remains to be achieved. Recent approaches rely on domain general principles of learning and representation captured in artificial neural networks. Most current models, however, focus too closely on language itself. In humans, language is part of a larger system for acquiring, representing, and communicating about objects and situations in the physical and social world, and future machine language models should emulate such a system. We describe existing machine models linking language to concrete situations, and point toward extensions to address more abstract cases. Human language processing exploits complementary learning systems, including a deep neural network-like learning system that learns gradually as machine systems do, as well as a fast-learning system that supports learning new information quickly. Adding such a system to machine language models will be an important further step toward truly human-like language understanding.
Formal Verification of Debates in Argumentation Theory
Jha, Ria, Belardinelli, Francesco, Toni, Francesca
Humans engage in informal debates on a daily basis. By expressing their opinions and ideas in an argumentative fashion, they are able to gain a deeper understanding of a given problem and in some cases, find the best possible course of actions towards resolving it. In this paper, we develop a methodology to verify debates formalised as abstract argumentation frameworks. We first present a translation from debates to transition systems. Such transition systems can model debates and represent their evolution over time using a finite set of states. We then formalise relevant debate properties using temporal and strategy logics. These formalisations, along with a debate transition system, allow us to verify whether a given debate satisfies certain properties. The verification process can be automated using model checkers. Therefore, we also measure their performance when verifying debates, and use the results to discuss the feasibility of model checking debates.
Automatic Layout Generation with Applications in Machine Learning Engine Evaluation
Yang, Haoyu, Chen, Wen, Pathak, Piyush, Gennari, Frank, Lai, Ya-Chieh, Yu, Bei
Machine learning-based lithography hotspot detection has been deeply studied recently, from varies feature extraction techniques to efficient learning models. It has been observed that such machine learning-based frameworks are providing satisfactory metal layer hotspot prediction results on known public metal layer benchmarks. In this work, we seek to evaluate how these machine learning-based hotspot detectors generalize to complicated patterns. We first introduce a automatic layout generation tool that can synthesize varies layout patterns given a set of design rules. The tool currently supports both metal layer and via layer generation. As a case study, we conduct hotspot detection on the generated via layer layouts with representative machine learning-based hotspot detectors, which shows that continuous study on model robustness and generality is necessary to prototype and integrate the learning engines in DFM flows. The source code of the layout generation tool will be available at https://github. com/phdyang007/layout-generation.
Learning Improvement Heuristics for Solving the Travelling Salesman Problem
Wu, Yaoxin, Song, Wen, Cao, Zhiguang, Zhang, Jie, Lim, Andrew
Recent studies in using deep learning to solve the Travelling Salesman Problem (TSP) focus on construction heuristics, the solution of which may still be far from optimal-ity. To improve solution quality, additional procedures such as sampling or beam search are required. However, they are still based on the same construction policy, which is less effective in refining a solution. In this paper, we propose to directly learn the improvement heuristics for solving TSP based on deep reinforcement learning. We first present a reinforcement learning formulation for the improvement heuristic, where the policy guides selection of the next solution. Then, we propose a deep architecture as the policy network based on self-attention. Extensive experiments show that, improvement policies learned by our approach yield better results than state-of-the-art methods, even from random initial solutions. Moreover, the learned policies are more effective than the traditional handcrafted ones, and robust to different initial solutions with either high or poor quality. 1 Introduction The Travelling Salesman Problem (TSP) is a typical combinatorial optimization problem that has extensive applications in the real world. The problem statement is straightforward: given a set of locations, find the salesman a shortest tour that traverses each location exactly once and returns to the original one. Although having been widely studied for decades, achieving satisfactory performance is still challenging due to its NPhard complexity.
A researcher in Japan designed an AI program for Othello that always loses to human players
A new online version of the game Othello has become a hit in Japan because the AI has been designed to always lose, and players love it. The game, called'The weakest AI Othello,' was released in August and has since attracted over 400,000 players for more than 1.29 million games. It was developed by Takuma Yoshida, who works at Avilen,a Tokyo firm that designs AI and machine learning tools for businesses. 'The Weakest AI Othello' is an online version of the popular board game, in which the computer AI has been designed to always lose to the human player One day at work, Yoshida began to question why he was spending so much time trying to engineer software to outperform humans. He wondered whether human attitudes toward AI and robotics might be different if humans didn't always expect to be beaten by them, according to a report in the Asahi Shimbun.