Offline Meta-Reinforcement Learning with Advantage Weighting
Mitchell, Eric, Rafailov, Rafael, Peng, Xue Bin, Levine, Sergey, Finn, Chelsea
–arXiv.org Artificial Intelligence
This paper introduces the offline meta-reinforcement learning (offline meta-RL) problem setting and proposes an algorithm that performs well in this setting. Offline meta-RL is analogous to the widely successful supervised learning strategy of pretraining a model on a large batch of fixed, pre-collected data (possibly from various tasks) and fine-tuning the model to a new task with relatively little data. That is, in offline meta-RL, we meta-train on fixed, pre-collected data from several tasks and adapt to a new task with a very small amount (less than 5 trajectories) of data from the new task. By nature of being offline, algorithms for offline meta-RL can utilize the largest possible pool of training data available and eliminate potentially unsafe or costly data collection during meta-training. This setting inherits the challenges of offline RL, but it differs significantly because offline RL does not generally consider a) transfer to new tasks or b) limited data from the test task, both of which we face in offline meta-RL. Targeting the offline meta-RL setting, we propose Meta-Actor Critic with Advantage Weighting (MACAW). MACAW is an optimization-based meta-learning algorithm that uses simple, supervised regression objectives for both the inner and outer loop of meta-training. On offline variants of common meta-RL benchmarks, we empirically find that this approach enables fully offline meta-reinforcement learning and achieves notable gains over prior methods. Meta-reinforcement learning (meta-RL) has emerged as a promising strategy for tackling the high sample complexity of reinforcement learning algorithms, when the goal is to ultimately learn many tasks. Meta-RL algorithms exploit shared structure among tasks during meta-training, amortizing the cost of learning across tasks and enabling rapid adaptation to new tasks during meta-testing from only a small amount of experience. Yet unlike in supervised learning, where large amounts of pre-collected data can be pooled from many sources to train a single model, existing meta-RL algorithms assume the ability to collect millions of environment interactions online during meta-training.
arXiv.org Artificial Intelligence
Aug-13-2020
- Country:
- Europe > Germany
- Bavaria > Upper Bavaria > Munich (0.04)
- North America > United States
- California
- Alameda County > Berkeley (0.04)
- Los Angeles County > Long Beach (0.04)
- San Diego County > San Diego (0.04)
- Santa Clara County > Palo Alto (0.04)
- New York > New York County
- New York City (0.04)
- Texas (0.04)
- California
- Europe > Germany
- Genre:
- Research Report (1.00)
- Technology: