Reinforcement Learning
Automated Model Compression by Jointly Applied Pruning and Quantization
Tang, Wenting, Wei, Xingxing, Li, Bo
In the traditional deep compression framework, iteratively performing network pruning and quantization can reduce the model size and computation cost to meet the deployment requirements. However, such a step-wise application of pruning and quantization may lead to suboptimal solutions and unnecessary time consumption. In this paper, we tackle this issue by integrating network pruning and quantization as a unified joint compression problem and then use AutoML to automatically solve it. We find the pruning process can be regarded as the channel-wise quantization with 0 bit. Thus, the separate two-step pruning and quantization can be simplified as the one-step quantization with mixed precision. This unification not only simplifies the compression pipeline but also avoids the compression divergence. To implement this idea, we propose the automated model compression by jointly applied pruning and quantization (AJPQ). AJPQ is designed with a hierarchical architecture: the layer controller controls the layer sparsity, and the channel controller decides the bit-width for each kernel. Following the same importance criterion, the layer controller and the channel controller collaboratively decide the compression strategy. With the help of reinforcement learning, our one-step compression is automatically achieved. Compared with the state-of-the-art automated compression methods, our method obtains a better accuracy while reducing the storage considerably. For fixed precision quantization, AJPQ can reduce more than five times model size and two times computation with a slight performance increase for Skynet in remote sensing object detection. When mixed-precision is allowed, AJPQ can reduce five times model size with only 1.06% top-5 accuracy decline for MobileNet in the classification task.
Algorithms in Reinforcement Learning
To learn the optimal action in unknown environment, Q-learning is the simple algorithm in reinforcement learning. Without having a model of an environment, it can learn the optimal and long-term action. And there have two policies called target policy and behavior policy. Tabular methods give correct policies and functions in tables. In Q-learning, to find optimal action value function, behavior policy can be achieved using policy iteration.
Autonomous Particle Accelerators: Accelerate Smarter With Artificial Intelligence
At DESY s ARES accelerator, the research team wants to gain experience with autonomous operation. Particle accelerators are universal tools: They help in production processes in industry, in tumor therapy in hospitals and enable unique discoveries and insights in research. Growing demands on the stability and properties of particle beams make a manual operation of these complex devices increasingly challenging – and require the highest possible level of automation to support operators. A new project of DESY and KIT (Karlsruhe Institute of Technology) is now taking the first steps towards a fully autonomously operated accelerator. The cooperation "Autonomous Accelerator," which is supported by the Helmholtz Association and the two participating Helmholtz research centers within the framework of the Helmholtz Artificial Intelligence Cooperation Unit, brings "reinforcement learning" to the operation of two linear accelerators at DESY and KIT. Reinforcement learning involves measuring state values and adjusting control variables to determine their influence on each other, thus learning a control strategy that also takes into account its effects in the future.
Exploratory Grasping: Asymptotically Optimal Algorithms for Grasping Challenging Polyhedral Objects
Danielczuk, Michael, Balakrishna, Ashwin, Brown, Daniel S., Devgon, Shivin, Goldberg, Ken
There has been significant recent work on data-driven algorithms for learning general-purpose grasping policies. However, these policies can consistently fail to grasp challenging objects which are significantly out of the distribution of objects in the training data or which have very few high quality grasps. Motivated by such objects, we propose a novel problem setting, Exploratory Grasping, for efficiently discovering reliable grasps on an unknown polyhedral object via sequential grasping, releasing, and toppling. We formalize Exploratory Grasping as a Markov Decision Process, study the theoretical complexity of Exploratory Grasping in the context of reinforcement learning and present an efficient bandit-style algorithm, Bandits for Online Rapid Grasp Exploration Strategy (BORGES), which leverages the structure of the problem to efficiently discover high performing grasps for each object stable pose. BORGES can be used to complement any general-purpose grasping algorithm with any grasp modality (parallel-jaw, suction, multi-fingered, etc) to learn policies for objects in which they exhibit persistent failures. Simulation experiments suggest that BORGES can significantly outperform both general-purpose grasping pipelines and two other online learning algorithms and achieves performance within 5% of the optimal policy within 1000 and 8000 timesteps on average across 46 challenging objects from the Dex-Net adversarial and EGAD! object datasets, respectively. Initial physical experiments suggest that BORGES can improve grasp success rate by 45% over a Dex-Net baseline with just 200 grasp attempts in the real world. See https://tinyurl.com/exp-grasping for supplementary material and videos.
Adaptive Neural Architectures for Recommender Systems
Rafailidis, Dimitrios, Antaris, Stefanos
Deep learning has proved an effective means to capture the non-linear associations of user preferences. However, the main drawback of existing deep learning architectures is that they follow a fixed recommendation strategy, ignoring users' real time-feedback. Recent advances of deep reinforcement strategies showed that recommendation policies can be continuously updated while users interact with the system. In doing so, we can learn the optimal policy that fits to users' preferences over the recommendation sessions. The main drawback of deep reinforcement strategies is that are based on predefined and fixed neural architectures. To shed light on how to handle this issue, in this study we first present deep reinforcement learning strategies for recommendation and discuss the main limitations due to the fixed neural architectures. Then, we detail how recent advances on progressive neural architectures are used for consecutive tasks in other research domains. Finally, we present the key challenges to fill the gap between deep reinforcement learning and adaptive neural architectures. We provide guidelines for searching for the best neural architecture based on each user feedback via reinforcement learning, while considering the prediction performance on real-time recommendations and the model complexity.
pymgrid: An Open-Source Python Microgrid Simulator for Applied Artificial Intelligence Research
Henri, Gonzague, Levent, Tanguy, Halev, Avishai, Alami, Reda, Cordier, Philippe
Microgrids, self contained electrical grids that are capable of disconnecting from the main grid, hold potential in both tackling climate change mitigation via reducing CO2 emissions and adaptation by increasing infrastructure resiliency. Due to their distributed nature, microgrids are often idiosyncratic; as a result, control of these systems is nontrivial. While microgrid simulators exist, many are limited in scope and in the variety of microgrids they can simulate. We propose pymgrid, an open-source Python package to generate and simulate a large number of microgrids, and the first open-source tool that can generate more than 600 different microgrids. pymgrid abstracts most of the domain expertise, allowing users to focus on control algorithms. In particular, pymgrid is built to be a reinforcement learning (RL) platform, and includes the ability to model microgrids as Markov decision processes. pymgrid also introduces two pre-computed list of microgrids, intended to allow for research reproducibility in the microgrid setting.
A Primal Approach to Constrained Policy Optimization: Global Optimality and Finite-Time Analysis
Xu, Tengyu, Liang, Yingbin, Lan, Guanghui
Safe reinforcement learning (SRL) problems are typically modeled as constrained Markov Decision Process (CMDP), in which an agent explores the environment to maximize the expected total reward and meanwhile avoids violating certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finite-time analysis of SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.
Offline Learning of Counterfactual Perception as Prediction for Real-World Robotic Reinforcement Learning
Jin, Jun, Graves, Daniel, Haigh, Cameron, Luo, Jun, Jagersand, Martin
We propose a method for offline learning of counterfactual predictions to address real world robotic reinforcement learning challenges. The proposed method encodes action-oriented visual observations as several "what if" questions learned offline from prior experience using reinforcement learning methods. These "what if" questions counterfactually predict how action-conditioned observation would evolve on multiple temporal scales if the agent were to stick to its current action. We show that combining these offline counterfactual predictions along with online in-situ observations (e.g. force feedback) allows efficient policy learning with only a sparse terminal (success/failure) reward. We argue that the learned predictions form an effective representation of the visual task, and guide the online exploration towards high-potential success interactions (e.g. contact-rich regions). Experiments were conducted in both simulation and real-world scenarios for evaluation. Our results demonstrate that it is practical to train a reinforcement learning agent to perform real-world fine manipulation in about half a day, without hand engineered perception systems or calibrated instrumentation. Recordings of the real robot training can be found via https://sites.google.com/view/realrl.
Reinforcement Learning Experiments and Benchmark for Solving Robotic Reaching Tasks
Aumjaud, Pierre, McAuliffe, David, Lera, Francisco Javier Rodríguez, Cardiff, Philip
Reinforcement learning has shown great promise in robotics thanks to its ability to develop efficient robotic control procedures through self-training. In particular, reinforcement learning has been successfully applied to solving the reaching task with robotic arms. In this paper, we define a robust, reproducible and systematic experimental procedure to compare the performance of various model-free algorithms at solving this task. The policies are trained in simulation and are then transferred to a physical robotic manipulator. It is shown that augmenting the reward signal with the Hindsight Experience Replay exploration technique increases the average return of off-policy agents between 7 and 9 folds when the target position is initialised randomly at the beginning of each episode.
Reinforcement Learning with Time-dependent Goals for Robotic Musicians
Fryen, Thilo, Eppe, Manfred, Nguyen, Phuong D. H., Gerkmann, Timo, Wermter, Stefan
Reinforcement learning is a promising method to accomplish robotic control tasks. The task of playing musical instruments is, however, largely unexplored because it involves the challenge of achieving sequential goals - melodies - that have a temporal dimension. In this paper, we address robotic musicianship by introducing a temporal extension to goal-conditioned reinforcement learning: Time-dependent goals. We demonstrate that these can be used to train a robotic musician to play the theremin instrument. We train the robotic agent in simulation and transfer the acquired policy to a real-world robotic thereminist. Supplemental video: https://youtu.be/jvC9mPzdQN4