Reinforcement Learning
Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic B\"uchi Automata
Oura, Ryohei, Sakakibara, Ami, Ushio, Toshimitsu
This letter proposes a novel reinforcement learning method for the synthesis of a control policy satisfying a control specification described by a linear temporal logic formula. We assume that the controlled system is modeled by a Markov decision process (MDP). We transform the specification to a limit-deterministic B\"uchi automaton (LDBA) with several accepting sets that accepts all infinite sequences satisfying the formula. The LDBA is augmented so that it explicitly records the previous visits to accepting sets. We take a product of the augmented LDBA and the MDP, based on which we define a reward function. The agent gets rewards whenever state transitions are in an accepting set that has not been visited for a certain number of steps. Consequently, sparsity of rewards is relaxed and optimal circulations among the accepting sets are learned. We show that the proposed method can learn an optimal policy when the discount factor is sufficiently close to one.
Reinforcement Learning for the Enterprise - DZone AI
This article is featured in the new DZone Guide to Artificial Intelligence. Get your free copy for more insightful articles, industry statistics, and more! Humanity has a unique ability to adapt to dynamic environments and learn from their surroundings and failures. It is something that machines lack, and that is where artificial intelligence seeks to correct this deficiency. However, traditional supervised machine learning techniques require a lot of proper historical data to learn patterns and then act based on them.
Reinforcement Learning and Its Implications for Enterprise Artificial Intelligence
Deep RL is where deep learning is used in conjunction with RL to simplify the reward function in cases where the search space is very large, or the environment is very complicated with multi-dimensional states, actions, and rewards. The use of deep learning with RL is also known as Q-learning in which a deep learning network is used as a function approximator (called the Q function), predicting the reward for an input, rather than trying to explore and store rewards and actions for every state. Also, in simulation environments, by simply feeding pixels of an environment through a neural network, it allows the reinforcement algorithm to better understand its environment. For the most part, RL is being used to teach AI systems how to play games, as games provide a safe and bounded environment for learning. For example, AlphaGo uses RL (in combination with other techniques) and similar techniques to have AI learn Atari games, or become champions at Poker.
Data Answers Enterprise Reinforcement Learning Challenges
The applicability of RL in the enterprise is vast and largely untapped. To date, most Deep Reinforcement Learning successes have focused on its application to games and robotics. In such cases, emulators and simulators are readily available and present the perfect environment in which to run trials without risk. By contrast, many of the problems that companies wish to solve do not come with a risk-free testing environment: It can be difficult and sometimes impossible to allow an AI agent to freely and rapidly explore the impact of its potential actions through trial and error. But the availability of a simulator is not essential to effectively applying RL techniques in enterprise settings.
Data Answers Enterprise Reinforcement Learning Challenges
The applicability of RL in the enterprise is vast and largely untapped. To date, most Deep Reinforcement Learning successes have focused on its application to games and robotics. In such cases, emulators and simulators are readily available and present the perfect environment in which to run trials without risk. By contrast, many of the problems that companies wish to solve do not come with a risk-free testing environment: It can be difficult and sometimes impossible to allow an AI agent to freely and rapidly explore the impact of its potential actions through trial and error. But the availability of a simulator is not essential to effectively applying RL techniques in enterprise settings.
Reinforcement Learning visualised with a predator prey ball game
This is a follow up to a previous article, where we looked at a simple Reinforcement Learning (RL) game in which a green ball learnt to reach a small circle at the centre of a canvas within 200 steps. We wrote a Q-learning algorithm and visualised it using a Tkinter based GUI. We will now give the green ball a slightly more complicated challenge. This time the aim is to learn to reach the centre within 200 steps as well but now there is another ball, a red ball, which the green ball must avoid. The red ball starts near the circle and moves randomly.
When Humans Aren't Optimal: Robots that Collaborate with Risk-Aware Humans
Kwon, Minae, Biyik, Erdem, Talati, Aditi, Bhasin, Karan, Losey, Dylan P., Sadigh, Dorsa
In order to collaborate safely and efficiently, robots need to anticipate how their human partners will behave. Some of today's robots model humans as if they were also robots, and assume users are always optimal. Other robots account for human limitations, and relax this assumption so that the human is noisily rational. Both of these models make sense when the human receives deterministic rewards: i.e., gaining either $100 or $130 with certainty. But in real world scenarios, rewards are rarely deterministic. Instead, we must make choices subject to risk and uncertainty--and in these settings, humans exhibit a cognitive bias towards suboptimal behavior. For example, when deciding between gaining $100 with certainty or $130 only 80% of the time, people tend to make the risk-averse choice--even though it leads to a lower expected gain! In this paper, we adopt a well-known Risk-Aware human model from behavioral economics called Cumulative Prospect Theory and enable robots to leverage this model during human-robot interaction (HRI). In our user studies, we offer supporting evidence that the Risk-Aware model more accurately predicts suboptimal human behavior. We find that this increased modeling accuracy results in safer and more efficient human-robot collaboration. Overall, we extend existing rational human models so that collaborative robots can anticipate and plan around suboptimal human behavior during HRI.
Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings
Shi, C., Zhang, S., Lu, W., Song, R.
Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity. We propose to model the action-value state function (Q-function) associated with a policy based on series/sieve method to derive its confidence interval. When the target policy depends on the observed data as well, we propose a SequentiAl Value Evaluation (SAVE) method to recursively update the estimated policy and its value estimator. As long as either the number of trajectories or the number of decision points diverges to infinity, we show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. Simulation studies are conducted to back up our theoretical findings. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status.
Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning
van der Meer, Michiel, Pirotta, Matteo, Bruni, Elia
In this work, we present an alternative approach to making an agent compositional through the use of a diagnostic classifier. Because of the need for explainable agents in automated decision processes, we attempt to interpret the latent space from an RL agent to identify its current objective in a complex language instruction. Results show that the classification process causes changes in the hidden states which makes them more easily interpretable, but also causes a shift in zero-shot performance to novel instructions. Lastly, we limit the supervisory signal on the classification, and observe a similar but less notable effect.
Policy Poisoning in Batch Reinforcement Learning and Control
Ma, Yuzhe, Zhang, Xuezhou, Sun, Wen, Zhu, Jerry
We study a security threat to batch reinforcement learning and control where the attacker aims to poison the learned policy. The victim is a reinforcement learner / controller which first estimates the dynamics and the rewards from a batch data set, and then solves for the optimal policy with respect to the estimates. The attacker can modify the data set slightly before learning happens, and wants to force the learner into learning a target policy chosen by the attacker. We present a unified framework for solving batch policy poisoning attacks, and instantiate the attack on two standard victims: tabular certainty equivalence learner in reinforcement learning and linear quadratic regulator in control. We show that both instantiation result in a convex optimization problem on which global optimality is guaranteed, and provide analysis on attack feasibility and attack cost.