acktr
- North America > Canada > Ontario > Toronto (0.15)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)
Reviews: Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation
The manuscript discusses an important topic, which is optimization in deep reinforcement learning. The authors extend the use of Kronecker-Factored approximation to develop a second order optimization method for deep reinforcement learning. The optimization method use kronecker-factored approximation to the Fisher matrix to estimate the curvature of the cost, resulting in a scalable approximation to natural gradients. The authors demonstrate the power of the method (termed ACKTR) in terms of the performance of agents in Atari and Mujoco RL environments, and compare the proposed algorithm to two previous methods (A2C and TRPO). Overall the manuscript is well-written and to my knowledge the methodology is a novel application to Kronecker-factored approximation.
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation
Yuhuai Wu, Elman Mansimov, Roger B. Grosse, Shun Liao, Jimmy Ba
In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this is the first scalable trust region natural gradient method for actor-critic methods. It is also the method that learns non-trivial tasks in continuous control as well as discrete control policies directly from raw pixel inputs. We tested our approach across discrete domains in Atari games as well as continuous domains in the Mu-JoCo environment. With the proposed methods, we are able to achieve higher rewards and a 2-to 3-fold improvement in sample efficiency on average, compared to previous state-of-the-art on-policy actor-critic methods.
- North America > Canada > Ontario > Toronto (0.29)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
Emergence of Different Modes of Tool Use in a Reaching and Dragging Task
Nguyen, Khuong, Choe, Yoonsuck
Tool use is an important milestone in the evolution of intelligence. In this paper, we investigate different modes of tool use that emerge in a reaching and dragging task. In this task, a jointed arm with a gripper must grab a tool (T, I, or L-shaped) and drag an object down to the target location (the bottom of the arena). The simulated environment had real physics such as gravity and friction. We trained a deep-reinforcement learning based controller (with raw visual and proprioceptive input) with minimal reward shaping information to tackle this task. We observed the emergence of a wide range of unexpected behaviors, not directly encoded in the motor primitives or reward functions. Examples include hitting the object to the target location, correcting error of initial contact, throwing the tool toward the object, as well as normal expected behavior such as wide sweep. Also, we further analyzed these behaviors based on the type of tool and the initial position of the target object. Our results show a rich repertoire of behaviors, beyond the basic built-in mechanisms of the deep reinforcement learning method we used.
- North America > United States > Texas > Brazos County > College Station (0.14)
- Asia > Middle East > Jordan (0.04)
ROS2Learn: a reinforcement learning framework for ROS 2
Nuin, Yue Leire Erro, Lopez, Nestor Gonzalez, Moral, Elias Barba, Juan, Lander Usategui San, Rueda, Alejandro Solano, Vilches, Víctor Mayoral, Kojcev, Risto
We propose a novel framework for Deep Reinforcement Learning (DRL) in modular robotics to train a robot directly from joint states, using traditional robotic tools. We use an state-of-the-art implementation of the Proximal Policy Optimization, Trust Region Policy Optimization and Actor-Critic Kronecker-Factored Trust Region algorithms to learn policies in four different Modular Articulated Robotic Arm (MARA) environments. We support this process using a framework that communicates with typical tools used in robotics, such as Gazebo and Robot Operating System 2 (ROS 2). We evaluate several algorithms in modular robots with an empirical study in simulation.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Asia > Middle East > Jordan (0.04)
On-Policy Trust Region Policy Optimisation with Replay Buffers
Kangin, Dmitry, Pugeault, Nicolas
Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. On-policy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the off-policy learning setting, to create the method, combining advantages of on- and off-policy learning. To achieve this, the proposed algorithm generalises the $Q$-, value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their off-policy counterpart DDPG.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > United Kingdom > England > Devon > Exeter (0.04)
Improving On-policy Learning with Statistical Reward Accumulation
Deng, Yubin, Yu, Ke, Lin, Dahua, Tang, Xiaoou, Loy, Chen Change
Deep reinforcement learning has obtained significant breakthroughs in recent years. Most methods in deep-RL achieve good results via the maximization of the reward signal provided by the environment, typically in the form of discounted cumulative returns. Such reward signals represent the immediate feedback of a particular action performed by an agent. However, tasks with sparse reward signals are still challenging to on-policy methods. In this paper, we introduce an effective characterization of past reward statistics (which can be seen as long-term feedback signals) to supplement this immediate reward feedback. In particular, value functions are learned with multi-critics supervision, enabling complex value functions to be more easily approximated in on-policy learning, even when the reward signals are sparse. We also introduce a novel exploration mechanism called "hot-wiring" that can give a boost to seemingly trapped agents. We demonstrate the effectiveness of our advantage actor multi-critic (A2MC) method across the discrete domains in Atari games as well as continuous domains in the MuJoCo environments. A video demo is provided at https://youtu.be/zBmpf3Yz8tc.
An Empirical Analysis of Proximal Policy Optimization with Kronecker-factored Natural Gradients
In this technical report, we consider an approach that combines the PPO objective and K-FAC natural gradient optimization, for which we call PPOKFAC. We perform a range of empirical analysis on various aspects of the algorithm, such as sample complexity, training speed, and sensitivity to batch size and training epochs. We observe that PPOKFAC is able to outperform PPO in terms of sample complexity and speed in a range of MuJoCo environments, while being scalable in terms of batch size. In spite of this, it seems that adding more epochs is not necessarily helpful for sample efficiency, and PPOKFAC seems to be worse than its A2C counterpart, ACKTR.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > Ontario > Toronto (0.04)
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation
Wu, Yuhuai, Mansimov, Elman, Grosse, Roger B., Liao, Shun, Ba, Jimmy
In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this is the first scalable trust region natural gradient method for actor-critic methods. It is also the method that learns non-trivial tasks in continuous control as well as discrete control policies directly from raw pixel inputs. We tested our approach across discrete domains in Atari games as well as continuous domains in the MuJoCo environment. With the proposed methods, we are able to achieve higher rewards and a 2- to 3-fold improvement in sample efficiency on average, compared to previous state-of-the-art on-policy actor-critic methods. Code is available at https://github.com/openai/baselines
- North America > Canada > Ontario > Toronto (0.15)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
Elon Musk's Research Venture Has Trained AI To Teach Itself
As part of its effort to find better ways to develop and train "safe artificial general intelligence," OpenAI has been releasing its own versions of reinforcement learning algorithms. They call these OpenAI Baselines, and the most recent additions to these algorithms are two baselines that are meant to enhance machine learning performance by making it more efficient. The first is a baseline implementation called Actor Critic using Kronecker-factored Trust Region (ACKTR). Developed by researchers from the University of Toronto (UofT) and New York University (NYU), ACKTR improves on the way AI policies perform deep reinforcement learning -- learning that is accomplished only by trial and error, and obtained only through raw observation. In a paper published online, the UofT and NYU researchers used simulated robots and Atari games to test how ACKTR learns control policies.
- North America > Canada > Ontario > Toronto (0.57)
- North America > United States > New York (0.26)