Reinforcement Learning
Continual Reinforcement Learning with Diversity Exploration and Adversarial Self-Correction
Zhu, Fengda, Chang, Xiaojun, Zeng, Runhao, Tan, Mingkui
Deep reinforcement learning has made significant progress in the field of continuous control, such as physical control and autonomous driving. However, it is challenging for a reinforcement model to learn a policy for each task sequentially due to catastrophic forgetting. Specifically, the model would forget knowledge it learned in the past when trained on a new task. We consider this challenge from two perspectives: i) acquiring task-specific skills is difficult since task information and rewards are not highly related; ii) learning knowledge from previous experience is difficult in continuous control domains. In this paper, we introduce an end-to-end framework namely Continual Diversity Adversarial Network (CDAN). We first develop an unsupervised diversity exploration method to learn task-specific skills using an unsupervised objective. Then, we propose an adversarial self-correction mechanism to learn knowledge by exploiting past experience. The two learning procedures are presumably reciprocal. To evaluate the proposed method, we propose a new continuous reinforcement learning environment named Continual Ant Maze (CAM) and a new metric termed Normalized Shorten Distance (NSD). The experimental results confirm the effectiveness of diversity exploration and self-correction. It is worthwhile noting that our final result outperforms baseline by 18.35% in terms of NSD, and 0.61 according to the average reward.
AI Adoption in the Enterprise
While O'Reilly has identified several trends among enterprise companies for adopting artificial intelligence, we decided to drill down further to learn just how businesses worldwide are planning and prioritizing this work. In a recent survey, we asked respondents about revenue-bearing AI projects their organizations have in production. How might their AI adoption patterns change over the course of the next year? In this report, data science experts Ben Lorica and Paco Nathan examine the survey's results to reveal how (and how many) respondents are ramping up AI projects. You'll also learn how the scope of AI use among companies is quickly expanding into deep learning, human in the loop, knowledge graphs, and reinforcement learning.
Google's AI picks which machine learning models will produce the best results
Leave it to the folks at Google to devise AI capable of predicting which machine learning models will produce the best results. In a newly-published paper ("Off-Policy Evaluation via Off-Policy Classification") and blog post, a team of Google AI researchers propose what they call "off-policy classification," or OPC, which evaluates the performance of AI-driven agents by treating evaluation as a classification problem. The team notes that their approach -- a variant of reinforcement learning, which employs rewards to drive software policies toward goals -- works with image inputs and scales to tasks including vision-based robotic grasping. "Fully off-policy reinforcement learning is a variant in which an agent learns entirely from older data, which is appealing because it enables model iteration without requiring a physical robot," writes Robotics at Google software engineer Alexa Irpan. "With fully off-policy RL, one can train several models on the same fixed dataset collected by previous agents, then select the best one."
Modeling AGI Safety Frameworks with Causal Influence Diagrams
Everitt, Tom, Kumar, Ramana, Krakovna, Victoria, Legg, Shane
One of the primary goals of AI research is the development of artificial agents that can exceed human performance on a wide range of cognitive tasks, in other words, artificial general intelligence (AGI). Although the development of AGI has many potential benefits, there are also many safety concerns that have been raised in the literature [Bostrom, 2014; Everitt et al., 2018; Amodei et al., 2016]. Various approaches for addressing AGI safety have been proposed [Leike et al., 2018; Christiano et al., 2018; Irving et al., 2018; Hadfield-Menell et al., 2016; Everitt, 2018], often presented as a modification of the reinforcement learning (RL) framework, or a new framework altogether. Understanding and comparing different frameworks for AGI safety can be difficult because they build on differing concepts and assumptions. For example, both reward modeling [Leike et al., 2018] and cooperative inverse RL [Hadfield-Menell et al., 2016] are frameworks for making an agent learn the preferences of a human user, but what are the key differences between them?
Cooperative Lane Changing via Deep Reinforcement Learning
Wang, Guan, Hu, Jianming, Li, Zhiheng, Li, Li
In this paper, we study how to learn an appropriate lane changing strategy for autonomous vehicles by using deep reinforcement learning. We show that the reward of the system should consider the overall traffic efficiency instead of the travel efficiency of an individual vehicle. In summary, cooperation leads to a more harmonic and efficient traffic system rather than competition
Adversarial Learning for Improved Onsets and Frames Music Transcription
Kim, Jong Wook, Bello, Juan Pablo
Automatic music transcription is considered to be one of the hardest problems in music information retrieval, yet recent deep learning approaches have achieved substantial improvements on transcription performance. These approaches commonly employ supervised learning models that predict various time-frequency representations, by minimizing element-wise losses such as the cross entropy function. However, applying the loss in this manner assumes conditional independence of each label given the input, and thus cannot accurately express inter-label dependencies. To address this issue, we introduce an adversarial training scheme that operates directly on the time-frequency representations and makes the output distribution closer to the ground-truth. Through adversarial learning, we achieve a consistent improvement in both frame-level and note-level metrics over Onsets and Frames, a state-of-the-art music transcription model. Our results show that adversarial learning can significantly reduce the error rate while increasing the confidence of the model estimations. Our approach is generic and applicable to any transcription model based on multi-label predictions, which are very common in music signal analysis.
Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning
Addanki, Ravichandra, Venkatakrishnan, Shaileshh Bojja, Gupta, Shreyan, Mao, Hongzi, Alizadeh, Mohammad
We present Placeto, a reinforcement learning (RL) approach to efficiently find device placements for distributed neural network training. Unlike prior approaches that only find a device placement for a specific computation graph, Placeto can learn generalizable device placement policies that can be applied to any graph. We propose two key ideas in our approach: (1) we represent the policy as performing iterative placement improvements, rather than outputting a placement in one shot; (2) we use graph embeddings to capture relevant information about the structure of the computation graph, without relying on node labels for indexing. These ideas allow Placeto to train efficiently and generalize to unseen graphs. Our experiments show that Placeto requires up to 6.1x fewer training steps to find placements that are on par with or better than the best placements found by prior approaches. Moreover, Placeto is able to learn a generalizable placement policy for any given family of graphs, which can then be used without any retraining to predict optimized placements for unseen graphs from the same family. This eliminates the large overhead incurred by prior RL approaches whose lack of generalizability necessitates re-training from scratch every time a new graph is to be placed.
When Multiple Agents Learn to Schedule: A Distributed Radio Resource Management Framework
Naderializadeh, Navid, Sydir, Jaroslaw, Simsek, Meryem, Nikopour, Hosein, Talwar, Shilpa
Interference among concurrent transmissions in a wireless network is a key factor limiting the system performance. One way to alleviate this problem is to manage the radio resources in order to maximize either the average or the worst-case performance. However, joint consideration of both metrics is often neglected as they are competing in nature. In this article, a mechanism for radio resource management using multi-agent deep reinforcement learning (RL) is proposed, which strikes the right trade-off between maximizing the average and the $5^{th}$ percentile user throughput. Each transmitter in the network is equipped with a deep RL agent, receiving partial observations from the network (e.g., channel quality, interference level, etc.) and deciding whether to be active or inactive at each scheduling interval for given radio resources, a process referred to as link scheduling. Based on the actions of all agents, the network emits a reward to the agents, indicating how good their joint decisions were. The proposed framework enables the agents to make decisions in a distributed manner, and the reward is designed in such a way that the agents strive to guarantee a minimum performance, leading to a fair resource allocation among all users across the network. Simulation results demonstrate the superiority of our approach compared to decentralized baselines in terms of average and $5^{th}$ percentile user throughput, while achieving performance close to that of a centralized exhaustive search approach. Moreover, the proposed framework is robust to mismatches between training and testing scenarios. In particular, it is shown that an agent trained on a network with low transmitter density maintains its performance and outperforms the baselines when deployed in a network with a higher transmitter density.
Split Q Learning: Reinforcement Learning with Two-Stream Rewards
Lin, Baihan, Bouneffouf, Djallel, Cecchi, Guillermo
Drawing an inspiration from behavioral studies of human decision making, we propose here a general parametric framework for a reinforcement learning problem, which extends the standard Q-learning approach to incorporate a two-stream framework of reward processing with biases biologically associated with several neurological and psychiatric conditions, including Parkinson's and Alzheimer's diseases, attention-deficit/hyperactivity disorder (ADHD), addiction, and chronic pain. For AI community, the development of agents that react differently to different types of rewards can enable us to understand a wide spectrum of multi-agent interactions in complex real-world socioeconomic systems. Moreover, from the behavioral modeling perspective, our parametric framework can be viewed as a first step towards a unifying computational model capturing reward processing abnormalities across multiple mental conditions and user preferences in long-term recommendation systems.
Learning Reward Functions by Integrating Human Demonstrations and Preferences
Palan, Malayandi, Landolfi, Nicholas C., Shevchuk, Gleb, Sadigh, Dorsa
Our goal is to accurately and efficiently learn reward functions for autonomous robots. Current approaches to this problem include inverse reinforcement learning (IRL), which uses expert demonstrations, and preference-based learning, which iteratively queries the user for her preferences between trajectories. In robotics however, IRL often struggles because it is difficult to get high-quality demonstrations; conversely, preference-based learning is very inefficient since it attempts to learn a continuous, high-dimensional function from binary feedback. We propose a new framework for reward learning, DemPref, that uses both demonstrations and preference queries to learn a reward function. Specifically, we (1) use the demonstrations to learn a coarse prior over the space of reward functions, to reduce the effective size of the space from which queries are generated; and (2) use the demonstrations to ground the (active) query generation process, to improve the quality of the generated queries. Our method alleviates the efficiency issues faced by standard preference-based learning methods and does not exclusively depend on (possibly low-quality) demonstrations. In numerical experiments, we find that DemPref is significantly more efficient than a standard active preference-based learning method. In a user study, we compare our method to a standard IRL method; we find that users rated the robot trained with DemPref as being more successful at learning their desired behavior, and preferred to use the DemPref system (over IRL) to train the robot.