Reinforcement Learning
A3C -- What It Is & What I Built
The basic actor-critic model stems from Deep Convolution Q-Learning which is where the agent implements q-learning, but instead of taking in a matrix of states as input, it takes in images and feeds them into a deep convolutional neural network. Don't worry about the rectangles on the right side, they represent a deep neural network with all the nodes and connections. It's just easier to explain and understand A3C this way. In a regular Deep Convolution Q-Learning network, there would only be one output and that would be the q-values of the different actions. However in A3C, there are two outputs, one of the q-values for the different actions and the other to calculate the value of being in the state the agent is actually in.
New Game Theory Innovations that are Influencing Reinforcement Learning
Game theory plays a fundamental factor in modern artificial intelligence(AI) solutions. Specifically, deep reinforcement learning(DRL) is an area of AI that embraced game theory as a first-class citize. From single-agent programs to complex multi-agent DRL environments, gamifying dynamics are present across the lifecycle of AI programs. The fascinating thing is that the rapid evolution of DRL has also triggered a renewed interesting in game theory research. The relationship between game theory and DRL seems trivial.
MAME : Model-Agnostic Meta-Exploration
Gurumurthy, Swaminathan, Kumar, Sumit, Sycara, Katia
Meta-Reinforcement learning approaches aim to develop learning procedures that can adapt quickly to a distribution of tasks with the help of a few examples. Developing efficient exploration strategies capable of finding the most useful samples becomes critical in such settings. Existing approaches towards finding efficient exploration strategies add auxiliary objectives to promote exploration by the pre-update policy, however, this makes the adaptation using a few gradient steps difficult as the pre-update (exploration) and post-update (exploitation) policies are often quite different. Instead, we propose to explicitly model a separate exploration policy for the task distribution. Having two different policies gives more flexibility in training the exploration policy and also makes adaptation to any specific task easier. We show that using self-supervised or supervised learning objectives for adaptation allows for more efficient inner-loop updates and also demonstrate the superior performance of our model compared to prior works in this domain.
Value-Added Chemical Discovery Using Reinforcement Learning
Jiang, Peihong, Doan, Hieu, Madireddy, Sandeep, Assary, Rajeev Surendran, Balaprakash, Prasanna
Computer-assisted synthesis planning aims to help chemists find better reaction pathways faster. Finding viable and short pathways from sugar molecules to value-added chemicals can be modeled as a retrosynthesis planning problem with a catalyst allowed. This is a crucial step in efficient biomass conversion. The traditional computational chemistry approach to identifying possible reaction pathways involves computing the reaction energies of hundreds of intermediates, which is a critical bottleneck in silico reaction discovery. Deep reinforcement learning has shown in other domains that a well-trained agent with little or no prior human knowledge can surpass human performance. While some effort has been made to adapt machine learning techniques to the retrosynthesis planning problem, value-added chemical discovery presents unique challenges. Specifically, the reaction can occur in several different sites in a molecule, a subtle case that has never been treated in previous works. With a more versatile formulation of the problem as a Markov decision process, we address the problem using deep reinforcement learning techniques and present promising preliminary results.
Deep Reinforcement Learning Based Dynamic Trajectory Control for UAV-assisted Mobile Edge Computing
Wang, Liang, Wang, Kezhi, Pan, Cunhua, Xu, Wei, Aslam, Nauman, Nallanathan, Arumugam
In this paper, we consider a platform of flying mobile edge computing (F-MEC), where unmanned aerial vehicles (UA Vs) serve as equipment providing computation resource, and they enable task offload-ing from user equipment (UE). We aim to minimize energy consumption of all the UEs via optimizing the user association, resource allocation and the trajectory of UA Vs. To this end, we first propose a Convex optimizAtion based Trajectory control algorithm (CA T), which solves the problem in an iterative way by using block coordinate descent (BCD) method. Then, to make the real-time decision while taking into account the dynamics of the environment (i.e., UA V may take off from different locations), we propose a deep Reinforcement leArning based Trajectory control algorithm (RA T). In RA T, we apply the Prioritized Experience Replay (PER) to improve the convergence of the training procedure. Different from the convex optimization based algorithm which may be susceptible to the initial points and requires iterations, RA T can be adapted to any taking off points of the UA Vs and can obtain the solution more rapidly than CA T once training process has been completed. Simulation results show that the proposed CA T and RA T achieve the similar performance and both outperform traditional algorithms. Liang, Kezhi and Nauman are with the Department of Computer and Informantion Science, Northumbria University, Newcastle upon Tyne, UK, NE1 8ST. Cunhua and Arumugam are with School of Electronic Engineering and Computer Science, Queen Mary University of London, E1 4NS, U.K. Wei is with National Mobile Communications Research Lab, Southeast University, China. I NTRODUCTION With the popularity of computationally-intensive tasks, e.g., smart navigation and augmented reality, people are expecting to enjoy more convenient life than ever before. However, current smart devices and user equipments (UEs), due to small size and limited resource, e.g., computation and battery, may not be able to provide satisfactory Quality of Service (QoS) and Quality of Experience (QoE) in executing those highly demanding tasks. Mobile edge computing (MEC) has been proposed by moving the computation resource to the network edge and it has been proved to greatly enhance UE's ability in executing computation-hungry tasks [1].
Teaching machine learning through robot application development on AWS Amazon Web Services
Today, machine learning influences research and consumer products and is leading to breakthroughs across industries like healthcare, manufacturing, finance, and retail. In the field of reinforcement learning, machine learning meets the real world when applied to robotics. Knowing this, how can we ensure students are skilled and prepared to leverage the power of this technology? Intermind Co. is an education group bringing academic programs from leading universities on subjects like machine learning and artificial intelligence to international college students. We recently created a project-based learning experience around the use of Robot Operating System (ROS), the leading open-source framework for writing robot software, and AWS RoboMaker, a service that helps develop, test, and deploy intelligent robotics applications at scale.
Modelling Bahdanau Attention using Election methods aided by Q-Learning
Neural Machine Translation has lately gained a lot of "attention" with the advent of more and more sophisticated but drastically improved models. Attention mechanism has proved to be a boon in this direction by providing weights to the input words, making it easy for the decoder to identify words representing the present context. But by and by, as newer attention models with more complexity came into development, they involved large computation, making inference slow. In this paper, we have modelled the attention network using techniques resonating with social choice theory. Along with that, the attention mechanism, being a Markov Decision Process, has been represented by reinforcement learning techniques. Thus, we propose to use an election method ( k -Borda), fine-tuned using Q-learning, as a replacement for attention networks. The inference time for this network is less than a standard Bahdanau translator, and the results of the translation are comparable. This not only experimentally verifies the claims stated above but also helped provide a faster inference.
Worst Cases Policy Gradients
Tang, Yichuan Charlie, Zhang, Jian, Salakhutdinov, Ruslan
Recent advances in deep reinforcement learning have demonstrated the capability of learning complex control policies from many types of environments. When learning policies for safety-critical applications, it is essential to be sensitive to risks and avoid catastrophic events. Towards this goal, we propose an actor-critic framework that models the uncertainty of the future and simultaneously learns a policy based on that uncertainty model. Specifically, given a distribution of the future return for any state and action, we optimize policies for varying levels of conditional Value-at-Risk. The learned policy can map the same state to different actions depending on the propensity for risk. We demonstrate the effectiveness of our approach in the domain of driving simulations, where we learn maneuvers in two scenarios. Our learned controller can dynamically select actions along a continuous axis, where safe and conservative behaviors are found at one end while riskier behaviors are found at the other. Finally, when testing with very different simulation parameters, our risk-averse policies generalize significantly better compared to other reinforcement learning approaches.
Scalable Efficient Deep-RL
Traditional scalable reinforcement learning framework, such as IMPALA and R2D2, runs multiple agents in parallel to collect transitions, each with its own copy of model from the parameter server(or learner). This architecture imposes high bandwidth requirements since they demand transfers of model parameters, environment information and etc. In this article, we discuss a modern scalable RL agent called SEED(Scalable Efficient Deep-RL), proposed by Espeholt&Marinier&Stanczyk et al in Google Brain team. Here we compare SEED with IMPALA. The IMPALA architecture, which is also used in various forms in Ape-X, OpenAI Rapid and etc., mainly consists of two parts: A large number of actors periodically copy model parameters from the learner, and interact with environments to collect trajectories, while the learner(s) asynchronously receives transitions from the actors and optimizes its model.