PWM: Policy Learning with Large World Models
Georgiev, Ignat, Giridhar, Varun, Hansen, Nicklas, Garg, Animesh
–arXiv.org Artificial Intelligence
Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in multi-task settings with different embodiments. World models offer scalability by learning a simulation of the environment, yet they often rely on inefficient gradient-free optimization methods. We introduce Policy learning with large World Models (PWM), a novel model-based RL algorithm that learns continuous control policies from large multi-task world models. By pre-training the world model on offline data and using it for first-order gradient policy learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods using ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without the need for expensive online planning. Visualizations and code available at https://www.imgeorgiev.com/pwm
arXiv.org Artificial Intelligence
Jul-3-2024
- Country:
- Europe > Netherlands (0.14)
- North America (0.14)
- Genre:
- Research Report > New Finding (0.67)
- Technology: