Goto

Collaborating Authors

 Reinforcement Learning


Thompson Sampling for Noncompliant Bandits

arXiv.org Machine Learning

Multi-Armed Bandit (MAB) (Sutton and Barto, 1998) problems are a class of sequential decision-making problems where an agent seeks to maximize rewards by acting in an unknown stationary environment. The MAB problem is often caricaturized using a set of slot machines with unknown payout distributions. The agent must decide which arm to pull in order to maximize earnings. Because the machines' reward distributions are initially unknown, the bandit must select actions that balance exploration (learning the reward distributions) with exploitation (playing the machine with highest expected reward). Contextual bandits (CB) (Li et al., 2010a) are a slightly modified MAB problem where the reward distributions are conditioned on an observation which is revealed to the agent prior to the selection of an action.


Mitigating Planner Overfitting in Model-Based Reinforcement Learning

arXiv.org Artificial Intelligence

An agent with an inaccurate model of its environment faces a difficult choice: it can ignore the errors in its model and act in the real world in whatever way it determines is optimal with respect to its model. Alternatively, it can take a more conservative stance and eschew its model in favor of optimizing its behavior solely via real-world interaction. This latter approach can be exceedingly slow to learn from experience, while the former can lead to "planner overfitting" - aspects of the agent's behavior are optimized to exploit errors in its model. This paper explores an intermediate position in which the planner seeks to avoid overfitting through a kind of regularization of the plans it considers. We present three different approaches that demonstrably mitigate planner overfitting in reinforcement-learning environments.


Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning

arXiv.org Artificial Intelligence

Learning is an inherently continuous phenomenon. When humans learn a new task there is no explicit distinction between training and inference. After we learn a task, we keep learning about it while performing the task. What we learn and how we learn it varies during different stages of learning. Learning how to learn and adapt is a key property that enables us to generalize effortlessly to new settings. This is in contrast with conventional settings in machine learning where a trained model is frozen during inference. In this paper we study the problem of learning to learn at both training and inference time in the context of visual navigation. A fundamental challenge in navigation is generalization to unseen scenes. In this paper we propose a self-adaptive visual navigation method (SAVN) which learns to adapt to new environments without any explicit supervision. Our solution is a meta-reinforcement learning approach where an agent learns a self-supervised interaction loss that encourages effective navigation. Our experiments, performed in the AI2-THOR framework, show major improvements in both success rate and SPL for visual navigation in novel scenes.


FoldingZero: Protein Folding from Scratch in Hydrophobic-Polar Model

arXiv.org Artificial Intelligence

De novo protein structure prediction from amino acid sequence is one of the most challenging problems in computational biology. As one of the extensively explored mathematical models for protein folding, Hydrophobic-Polar (HP) model enables thorough investigation of protein structure formation and evolution. Although HP model discretizes the conformational space and simplifies the folding energy function, it has been proven to be an NP-complete problem. In this paper, we propose a novel protein folding framework FoldingZero, self-folding a de novo protein 2D HP structure from scratch based on deep reinforcement learning. FoldingZero features the coupled approach of a two-head (policy and value heads) deep convolutional neural network (HPNet) and a regularized Upper Confidence Bounds for Trees (R-UCT). It is trained solely by a reinforcement learning algorithm, which improves HPNet and R-UCT iteratively through iterative policy optimization. Without any supervision and domain knowledge, FoldingZero not only achieves comparable results, but also learns the latent folding knowledge to stabilize the structure. Without exponential computation, FoldingZero shows promising potential to be adopted for real-world protein properties prediction.


Generative Adversarial Self-Imitation Learning

arXiv.org Artificial Intelligence

This paper explores a simple regularizer for reinforcement learning by proposing Generative Adversarial Self-Imitation Learning (GASIL), which encourages the agent to imitate past good trajectories via generative adversarial imitation learning framework. Instead of directly maximizing rewards, GASIL focuses on reproducing past good trajectories, which can potentially make long-term credit assignment easier when rewards are sparse and delayed. GASIL can be easily combined with any policy gradient objective by using GASIL as a learned shaped reward function. Our experimental results show that GASIL improves the performance of proximal policy optimization on 2D Point Mass and MuJoCo environments with delayed reward and stochastic dynamics.


Machine learning to optimize traffic and reduce pollution

#artificialintelligence

Applying artificial intelligence to self-driving cars to smooth traffic, reduce fuel consumption, and improve air quality predictions may sound like the stuff of science fiction, but researchers at the Department of Energy's Lawrence Berkeley National Laboratory (Berkeley Lab) have launched two research projects to do just that. In collaboration with UC Berkeley, Berkeley Lab scientists are using deep reinforcement learning, a computational tool for training controllers, to make transportation more sustainable. One project uses deep reinforcement learning to train autonomous vehicles to drive in ways to simultaneously improve traffic flow and reduce energy consumption. A second uses deep learning algorithms to analyze satellite images combined with traffic information from cell phones and data already being collected by environmental sensors to improve air quality predictions. "Thirty percent of energy use in the U.S. is to transport people and goods, and this energy consumption contributes to air pollution, including approximately half of all nitrogen oxide emissions, a precursor to particular matter and ozone โ€“ and black carbon (soot) emissions," said Tom Kirchstetter, director of Berkeley Lab's Energy Analysis and Environmental Impacts Division, an adjunct professor at UC Berkeley, and a member of the research team.


Macro action selection with deep reinforcement learning in StarCraft

arXiv.org Artificial Intelligence

StarCraft (SC) is one of the most popular and successful Real Time Strategy (RTS) games. In recent years, SC is also considered as a testbed for AI research, due to its enormous state space, hidden information, multi-agent collaboration and so on. Thanks to the annual AIIDE and CIG competitions, a growing number of bots are proposed and being continuously improved. However, a big gap still remains between the top bot and the professional human players. One vital reason is that current bots mainly rely on predefined rules to perform macro actions. These rules are not scalable and efficient enough to cope with the large but partially observed macro state space in SC. In this paper, we propose a DRL based framework to do macro action selection. Our framework combines the reinforcement learning approach Ape-X DQN with Long-Short-Term-Memory (LSTM) to improve the macro action selection in bot. We evaluate our bot, named as LastOrder, on the AIIDE 2017 StarCraft AI competition bots set. Our bot achieves overall 83% win-rate, outperforming 26 bots in total 28 entrants.


Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference

arXiv.org Artificial Intelligence

Lack of performance when it comes to continual learning over non-stationary distributions of data remains a major challenge in scaling neural network learning to more human realistic settings. In this work we propose a new conceptualization of the continual learning problem in terms of a temporally symmetric trade-off between transfer and interference that can be optimized by enforcing gradient alignment across examples. We then propose a new algorithm, Meta-Experience Replay (MER), that directly exploits this view by combining experience replay with optimization based meta-learning. This method learns parameters that make interference based on future gradients less likely and transfer based on future gradients more likely. We conduct experiments across continual lifelong supervised learning benchmarks and non-stationary reinforcement learning environments demonstrating that our approach consistently outperforms recently proposed baselines for continual learning. Our experiments show that the gap between the performance of MER and baseline algorithms grows both as the environment gets more non-stationary and as the fraction of the total experiences stored gets smaller.


Dynamic Measurement Scheduling for Adverse Event Forecasting using Deep RL

arXiv.org Machine Learning

Current clinical practice to monitor patients' health follows either regular or heuristic-based lab test (e.g. blood test) scheduling. Such practice not only gives rise to redundant measurements accruing cost, but may even lead to unnecessary patient discomfort. From the computational perspective, heuristic-based test scheduling might lead to reduced accuracy of clinical forecasting models. Computationally learning an optimal clinical test scheduling and measurement collection, is likely to lead to both, better predictive models and patient outcome improvement. We address the scheduling problem using deep reinforcement learning (RL) to achieve high predictive gain and low measurement cost, by scheduling fewer, but strategically timed tests. We first show that in the simulation our policy outperforms heuristic-based measurement scheduling with higher predictive gain or lower cost measured by accumulated reward. We then learn a scheduling policy for mortality forecasting in the real-world clinical dataset (MIMIC3), our learned policy is able to provide useful clinical insights. To our knowledge, this is the first RL application on multi-measurement scheduling problem in the clinical setting.


Practical Deep Reinforcement Learning Approach for Stock Trading

arXiv.org Machine Learning

Stock trading strategy plays a crucial role in investment companies. However, it is challenging to obtain optimal strategy in the complex and dynamic stock market. We explore the potential of deep reinforcement learning to optimize stock trading strategy and thus maximize investment return. 30 stocks are selected as our trading stocks and their daily prices are used as the training and trading market environment. We train a deep reinforcement learning agent and obtain an adaptive trading strategy. The agent's performance is evaluated and compared with Dow Jones Industrial Average and the traditional min-variance portfolio allocation strategy. The proposed deep reinforcement learning approach is shown to outperform the two baselines in terms of both the Sharpe ratio and cumulative returns.