Reinforcement Learning
Reinforcement Learning framework Dopamine opens up to new environments • DEVCLASS
Dopamine, a framework for experimenting with reinforcement learning (RL), has reached the 2.0 mark, now allowing the use of custom environments – just half a year after its initial launch. The project is based on popular numerical computation library TensorFlow and stems from a team of researchers at Google, though it isn't an official product of the company. It was meant for speculative research purposes and focuses on providing only a few heavily tested RL algorithms in an easy to use way. That is why for the first iteration the framework only included a single-GPU agent with implementations of n-step Bellman updates, prioritized experience replay, distributional reinforcement learning, and the Deep Q-Networks algorithm. According to a paper by members of the DeepMind team, which is also part of the Alphabet family, those approaches belong to the most important components of state-of-the-art reinforcement learning systems.
Microscopic Traffic Simulation by Cooperative Multi-agent Deep Reinforcement Learning
Bacchiani, Giulio, Molinari, Daniele, Patander, Marco
Expert human drivers perform actions relying on traffic laws and their previous experience. While traffic laws are easily embedded into an artificial brain, modeling human complex behaviors which come from past experience is a more challenging task. One of these behaviors is the capability of communicating intentions and negotiating the right of way through driving actions, as when a driver is entering a crowded roundabout and observes other cars movements to guess the best time to merge in. In addition, each driver has its own unique driving style, which is conditioned by both its personal characteristics, such as age and quality of sight, and external factors, such as being late or in a bad mood. For these reasons, the interaction between different drivers is not trivial to simulate in a realistic manner. In this paper, this problem is addressed by developing a microscopic simulator using a Deep Reinforcement Learning Algorithm based on a combination of visual frames, representing the perception around the vehicle, and a vector of numerical parameters. In particular, the algorithm called Asynchronous Advantage Actor-Critic has been extended to a multi-agent scenario in which every agent needs to learn to interact with other similar agents. Moreover, the model includes a novel architecture such that the driving style of each vehicle is adjustable by tuning some of its input parameters, permitting to simulate drivers with different levels of aggressiveness and desired cruising speeds.
Model Primitive Hierarchical Lifelong Reinforcement Learning
Wu, Bohan, Gupta, Jayesh K., Kochenderfer, Mykel J.
Learning interpretable and transferable subpolicies and performing task decomposition from a single, complex task is difficult. Some traditional hierarchical reinforcement learning techniques enforce this decomposition in a top-down manner, while meta-learning techniques require a task distribution at hand to learn such decompositions. This paper presents a framework for using diverse suboptimal world models to decompose complex task solutions into simpler modular subpolicies. This framework performs automatic decomposition of a single source task in a bottom up manner, concurrently learning the required modular subpolicies as well as a controller to coordinate them. We perform a series of experiments on high dimensional continuous action control tasks to demonstrate the effectiveness of this approach at both complex single task learning and lifelong learning. Finally, we perform ablation studies to understand the importance and robustness of different elements in the framework and limitations to this approach.
Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function)
Utility functions or their equivalents (value functions, objective functions, loss functions, reward functions, preference orderings) are a central tool in most current machine learning systems. These mechanisms for defining goals and guiding optimization run into practical and conceptual difficulty when there are independent, multi-dimensional objectives that need to be pursued simultaneously and cannot be reduced to each other. Ethicists have proved several impossibility theorems that stem from this origin; those results appear to show that there is no way of formally specifying what it means for an outcome to be good for a population without violating strong human ethical intuitions (in such cases, the objective function is a social welfare function). We argue that this is a practical problem for any machine learning system (such as medical decision support systems or autonomous weapons) or rigidly rule-based bureaucracy that will make high stakes decisions about human lives: such systems should not use objective functions in the strict mathematical sense. We explore the alternative of using uncertain objectives, represented for instance as partially ordered preferences, or as probability distributions over total orders. We show that previously known impossibility theorems can be transformed into uncertainty theorems in both of those settings, and prove lower bounds on how much uncertainty is implied by the impossibility results. We close by proposing two conjectures about the relationship between uncertainty in objectives and severe unintended consequences from AI systems.
Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space
Fan, Zhou, Su, Rui, Zhang, Weinan, Yu, Yong
In this paper we propose a hybrid architecture of actor-critic algorithms for reinforcement learning in parameterized action space, which consists of multiple parallel sub-actor networks to decompose the structured action space into simpler action spaces along with a critic network to guide the training of all sub-actor networks. While this paper is mainly focused on parameterized action space, the proposed architecture, which we call hybrid actor-critic, can be extended for more general action spaces which has a hierarchical structure. We present an instance of the hybrid actor-critic architecture based on proximal policy optimization (PPO), which we refer to as hybrid proximal policy optimization (H-PPO). Our experiments test H-PPO on a collection of tasks with parameterized action space, where H-PPO demonstrates superior performance over previous methods of parameterized action reinforcement learning.
Online Data Poisoning Attack
We study data poisoning attacks in the online learning setting where the training items stream in one at a time, and the adversary perturbs the current training item to manipulate present and future learning. In contrast, prior work on data poisoning attacks has focused on either batch learners in the offline setting, or online learners but with full knowledge of the whole training sequence. We show that online poisoning attack can be formulated as stochastic optimal control, and provide several practical attack algorithms based on control and deep reinforcement learning. Extensive experiments demonstrate the effectiveness of the attacks.
Optimizing Object-based Perception and Control by Free-Energy Principle
Li, Minne, Nashikkar, Pranav, Wang, Jun
One of the well-known formulations of human perception is a hierarchical inference model based on the interaction between conceptual knowledge and sensory stimuli from the partially observable environment. This model helps human to learn inductive biases and guides their behaviors by minimizing their surprise of observations. However, most model-based reinforcement learning still lacks the support of object-based physical reasoning. In this paper, we propose Object-based Perception Control (OPC). It combines the learning of perceiving objects from the scene and that of control of the objects in the perceived environments by the free-energy principle. Extensive experiments on high-dimensional pixel environments show that OPC outperforms several strong baselines in accumulated rewards and the quality of perceptual grouping.
Strong Asymptotic Optimality in General Environments
Cohen, Michael K., Catt, Elliot, Hutter, Marcus
Reinforcement Learning agents are expected to eventually perform well. Typically, this takes the form of a guarantee about the asymptotic behavior of an algorithm given some assumptions about the environment. We present an algorithm for a policy whose value approaches the optimal value with probability 1 in all computable probabilistic environments, provided the agent has a bounded horizon. This is known as strong asymptotic optimality, and it was previously unknown whether it was possible for a policy to be strongly asymptotically optimal in the class of all computable probabilistic environments. Our agent, Inquisitive Reinforcement Learner (Inq), is more likely to explore the more it expects an exploratory action to reduce its uncertainty about which environment it is in, hence the term inquisitive. Exploring inquisitively is a strategy that can be applied generally; for more manageable environment classes, inquisitiveness is tractable. We conducted experiments in "grid-worlds" to compare the Inquisitive Reinforcement Learner to other weakly asymptotically optimal agents.
NoRML: No-Reward Meta Learning
Yang, Yuxiang, Caluwaerts, Ken, Iscen, Atil, Tan, Jie, Finn, Chelsea
Efficiently adapting to new environments and changes in dynamics is critical for agents to successfully operate in the real world. Reinforcement learning (RL) based approaches typically rely on external reward feedback for adaptation. However, in many scenarios this reward signal might not be readily available for the target task, or the difference between the environments can be implicit and only observable from the dynamics. To this end, we introduce a method that allows for self-adaptation of learned policies: No-Reward Meta Learning (NoRML). NoRML extends Model Agnostic Meta Learning (MAML) for RL and uses observable dynamics of the environment instead of an explicit reward function in MAML's finetune step. Our method has a more expressive update step than MAML, while maintaining MAML's gradient based foundation. Additionally, in order to allow more targeted exploration, we implement an extension to MAML that effectively disconnects the meta-policy parameters from the fine-tuned policies' parameters. We first study our method on a number of synthetic control problems and then validate our method on common benchmark environments, showing that NoRML outperforms MAML when the dynamics change between tasks.
Translating Between Statistics and Machine Learning
I recently confronted this when I began reading about maximum causal entropy as part of a project on inverse reinforcement learning. Many of the terms were unfamiliar to me, but as I read closer, I realized that the concepts had close relationships with statistics concepts. This blog post presents a table of connections between terms that are standard in statistics and their related counterparts in machine learning. Understanding a result in machine learning can help to avoid reinventing the wheel in statistics and vice versa. My ability to understand inverse reinforcement learning benefited from my training in statistics because I was able to translate machine learning terminology into statistical terminology.