Reinforcement Learning
MQGrad: Reinforcement Learning of Gradient Quantization in Parameter Server
Cui, Guoxin, Xu, Jun, Zeng, Wei, Lan, Yanyan, Guo, Jiafeng, Cheng, Xueqi
One of the most significant bottleneck in training large scale machine learning models on parameter server (PS) is the communication overhead, because it needs to frequently exchange the model gradients between the workers and servers during the training iterations. Gradient quantization has been proposed as an effective approach to reducing the communication volume. One key issue in gradient quantization is setting the number of bits for quantizing the gradients. Small number of bits can significantly reduce the communication overhead while hurts the gradient accuracies, and vise versa. An ideal quantization method would dynamically balance the communication overhead and model accuracy, through adjusting the number bits according to the knowledge learned from the immediate past training iterations. Existing methods, however, quantize the gradients either with fixed number of bits, or with predefined heuristic rules. In this paper we propose a novel adaptive quantization method within the framework of reinforcement learning. The method, referred to as MQGrad, formalizes the selection of quantization bits as actions in a Markov decision process (MDP) where the MDP states records the information collected from the past optimization iterations (e.g., the sequence of the loss function values). During the training iterations of a machine learning algorithm, MQGrad continuously updates the MDP state according to the changes of the loss function. Based on the information, MDP learns to select the optimal actions (number of bits) to quantize the gradients. Experimental results based on a benchmark dataset showed that MQGrad can accelerate the learning of a large scale deep neural network while keeping its prediction accuracies.
Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations
In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. The optimal cost function of the aggregate problem, a nonlinear function of the features, serves as an architecture for approximation in value space of the optimal cost function or the cost functions of policies of the original problem. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with reinforcement learning based on deep neural networks, which is used to obtain the needed features. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by deep reinforcement learning, thereby potentially leading to more effective policy improvement.
Hallucinogenic Deep Reinforcement Learning using Python and Keras
This post is a step by step guide through the paper. We'll cover the technical details and also walk through how you can get a version running on your own machine. Similarly to my post on AlphaZero, I'm not associated with the authors of the paper but just wanted to share my interpretation of their terrific work. We're going to build a reinforcement learning algorithm (an'agent') that gets good at driving a car around a 2D racetrack. At each time-step, the algorithm is fed an observation (a 64 x 64 pixel colour image of the car and immediate surroundings) and needs to return the next set of actions to take -- specifically, the steering direction (-1 to 1), acceleration (0 to 1) and brake (0 to 1). This action is then passed to the environment, which returns the next observation and the cycle starts again.
Paper Repro: Deep Neuroevolution – Towards Data Science
In this post, we reproduce the recent Uber paper "Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning", which amazingly showed that simple genetic algorithms sometimes performed better than apparently advanced reinforcement learning algorithms on well studied problems such as Atari games. We will ourselves reach state of the art performance on Frostbite, a game that had stumped reinforcement learning algorithms for years before Uber finally solved it with this paper. We will also learn about the dark art of training neural networks using genetic algorithms. In a way this could be considered part 3 of my deep reinforcement learning, but I think this article can also stand alone. Note that unlike these previous tutorials, this post will be using PyTorch instead of Keras, mainly because this is what I personally have switched to, but also because PyTorch does happen to be more suited for this particular use case.
A Study on Overfitting in Deep Reinforcement Learning
Zhang, Chiyuan, Vinyals, Oriol, Munos, Remi, Bengio, Samy
Deep neural networks have proved to be effective function approximators in Reinforcement Learning (RL). Significant progress is seen in many RL problems ranging from board games like Go (Silver et al., 2016, 2017b), Chess and Shogi (Silver et al., 2017a), video games like Atari (Mnih et al., 2015) and StarCraft (Vinyals et al., 2017), to real world robotics and control tasks (Lillicrap et al., 2016). Most of these successes are due to improved training algorithms, carefully designed neural network architectures and powerful hardware. For example, in AlphaZero (Silver et al., 2017a), 5,000 1st-generation TPUs and 64 2nd-generation TPUs are used during self-play based training of agents with deep residual networks (He et al., 2016). On the other hand, learning with high-capacity models and long stretched training time on powerful devices could lead to potential risk of overfitting (Hardt et al., 2016; Lin et al., 2016). As a fundamental tradeoff in machine learning, preventing overfitting by properly controlling or regularizing the training is key to out-of-sample generalization. Studies of overfitting could be performed from the theory side, where generalization guarantees are derived for specific learning algorithms; or from the practice side, where carefully designed experimental protocols like cross validation are used as proxy to certify the generalization performance. Unfortunately, in the regime of deep RL, systematic studies of generalization behaviors from either theoretical or empirical perspectives are falling behind the rapid progresses from the algorithm development and application side. The current situation not only makes it difficult to understand the test behaviors like the vulnerabilities to potential adversarial attacks (Huang et al., 2017), but also renders some results difficult to reproduce or compare (Henderson et al., 2017; Machado et al., 2017).
Subgoal Discovery for Hierarchical Dialogue Policy Learning
Tang, Da, Li, Xiujun, Gao, Jianfeng, Wang, Chong, Li, Lihong, Jebara, Tony
Developing conversational agents to engage in complex dialogues is challenging partly because the dialogue policy needs to explore a large state-action space. In this paper, we propose a divide-and-conquer approach that discovers and exploits the hidden structure of the task to enable efficient policy learning. First, given a set of successful dialogue sessions, we present a Subgoal Discovery Network (SDN) to divide a complex goal-oriented task into a set of simpler subgoals in an unsupervised fashion. We then use these subgoals to learn a hierarchical policy which consists of 1) a top-level policy that selects among subgoals, and 2) a low-level policy that selects primitive actions to accomplish the subgoal. We exemplify our method by building a dialogue agent for the composite task of travel planning. Experiments with simulated and real users show that an agent trained with automatically discovered subgoals performs competitively against an agent with human-defined subgoals, and significantly outperforms an agent without subgoals. Moreover, we show that learned subgoals are human comprehensible.
PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making
Yang, Fangkai, Lyu, Daoming, Liu, Bo, Gustafson, Steven
Reinforcement learning and symbolic planning have both been used to build intelligent autonomous agents. Reinforcement learning relies on learning from interactions with real world, which often requires an unfeasibly large amount of experience. Symbolic planning relies on manually crafted symbolic knowledge, which may not be robust to domain uncertainties and changes. In this paper we present a unified framework {\em PEORL} that integrates symbolic planning with hierarchical reinforcement learning (HRL) to cope with decision-making in a dynamic environment with uncertainties. Symbolic plans are used to guide the agent's task execution and learning, and the learned experience is fed back to symbolic knowledge to improve planning. This method leads to rapid policy search and robust symbolic plans in complex domains. The framework is tested on benchmark domains of HRL.
Cross-domain Dialogue Policy Transfer via Simultaneous Speech-act and Slot Alignment
Mo, Kaixiang, Zhang, Yu, Yang, Qiang, Fung, Pascale
Dialogue policy transfer enables us to build dialogue policies in a target domain with little data by leveraging knowledge from a source domain with plenty of data. Dialogue sentences are usually represented by speech-acts and domain slots, and the dialogue policy transfer is usually achieved by assigning a slot mapping matrix based on human heuristics. However, existing dialogue policy transfer methods cannot transfer across dialogue domains with different speech-acts, for example, between systems built by different companies. Also, they depend on either common slots or slot entropy, which are not available when the source and target slots are totally disjoint and no database is available to calculate the slot entropy. To solve this problem, we propose a Policy tRansfer across dOMaIns and SpEech-acts (PROMISE) model, which is able to transfer dialogue policies across domains with different speech-acts and disjoint slots. The PROMISE model can learn to align different speech-acts and slots simultaneously, and it does not require common slots or the calculation of the slot entropy. Experiments on both real-world dialogue data and simulations demonstrate that PROMISE model can effectively transfer dialogue policies across domains with different speech-acts and disjoint slots.
Outline Objects using Deep Reinforcement Learning
Wang, Zhenxin, Sarcar, Sayan, Liu, Jingxin, Zheng, Yilin, Ren, Xiangshi
Image segmentation needs both local boundary position information and global object context information. The performance of the recent state-of-the-art method, fully convolutional networks, reaches a bottleneck due to the neural network limit after balancing between the two types of information simultaneously in an end-to-end training style. To overcome this problem, we divide the semantic image segmentation into temporal subtasks. First, we find a possible pixel position of some object boundary; then trace the boundary at steps within a limited length until the whole object is outlined. We present the first deep reinforcement learning approach to semantic image segmentation, called DeepOutline, which outperforms other algorithms in Coco detection leaderboard in the middle and large size person category in Coco val2017 dataset. Meanwhile, it provides an insight into a divide and conquer way by reinforcement learning on computer vision problems.
Optimising Traffic Using Reinforcement Learning – Becoming Human: Artificial Intelligence Magazine
Fundamentally, the root of the urban traffic distribution problem is in multi-criteria decision making. The Reinforcement Learning framework, in which an agent learns from a model with optimal policy based on its environment, could provide an advantageous method for algorithmic development and network improvement. Each action that the agent would take will lead to a reward or punishment with the new observation of the state. Through its learning progress, the agent will learn a distributed routing policy that could maximise the capacity of an urban transport network. This process could be treated as a Markov Decision Process (MDP), which ultimately aims for the best solution by optimising specific policy step by step.