evolved policy gradient
Evolved Policy Gradients
We propose a metalearning approach for learning gradient-based reinforcement learning (RL) algorithms. The idea is to evolve a differentiable loss function, such that an agent, which optimizes its policy to minimize this loss, will achieve high rewards. The loss is parametrized via temporal convolutions over the agent's experience. Because this loss is highly flexible in its ability to take into account the agent's history, it enables fast task learning. Empirical results show that our evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method. We also demonstrate that EPG's learned loss can generalize to out-of-distribution test time tasks, and exhibits qualitatively different behavior from other popular metalearning algorithms.
Reviews: Evolved Policy Gradients
The authors present an approach for learning loss functions for reinforcement learning via a combination of evolutionary strategies as an outer loop and a simple policy gradient algorithm in the inner loop. Overall I found this to be a very interesting paper. My one criticism is that I would have liked to see a bit more of a study of what parts of the algorithm and the loss architecture are important. The algorithm itself is relatively simple. Although I appreciate the detail of Algorithm 1, to some degree I feel that this obscures the algorithm. In essense this approach corresponds to "use policy gradient in the inner-loop, and ES in the outer loop".More interesting is the structure of the loss architecture.
OpenAI Brings Introspection To Reinforcement Learning Agents - AI Summary
Recently, researchers from OpenAI published a new paper that proposes a method to address this challenge by creating RL models that know what it means to make progress on a new task, by having experienced making progress on similar tasks in the past. Titled Evolved Policy Gradients(EPG), the OpenAI research paper introduces new meta-learning technique based on the concept of a loss function that qualifies the learning progress. When used in RL models, the EPG method does not encode the knowledge explicitly through memorized behaviors but, instead, it uses an implicitly mechanism through a learned loss function. The EPG end goal is that RL agents that can use this loss function to learn a novel task. In initial tests, EPG seems to improves on standard RL algorithms by allowing the loss function to be adaptive to the environment and agent history, leading to faster learning and the potential for learning without external rewards.
Evolved Policy Gradients
Houthooft, Rein, Chen, Yuhua, Isola, Phillip, Stadie, Bradly, Wolski, Filip, Ho, OpenAI Jonathan, Abbeel, Pieter
We propose a metalearning approach for learning gradient-based reinforcement learning (RL) algorithms. The idea is to evolve a differentiable loss function, such that an agent, which optimizes its policy to minimize this loss, will achieve high rewards. The loss is parametrized via temporal convolutions over the agent's experience. Because this loss is highly flexible in its ability to take into account the agent's history, it enables fast task learning. Empirical results show that our evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method.
Evolved Policy Gradients
We're releasing an experimental metalearning approach called Evolved Policy Gradients, a method that evolves the loss function of learning agents, which can enable fast training on novel tasks. Agents trained with EPG can succeed at basic tasks at test time that were outside their training regime, like learning to navigate to an object on a different side of the room from where it was placed during training. EPG trains agents to have a prior notion of what constitutes making progress on a novel task. Rather than encoding prior knowledge through a learned policy network, EPG encodes it as a learned loss function[1]. Agents are then able to use this loss function, defined as a temporal-convolutional neural network, to learn quickly on a novel task. We've shown that EPG can generalize to out of distribution test time tasks, exhibiting behavior qualitatively different from other popular metalearning algorithms.