RL Zero: Direct Policy Inference from Language Without In-Domain Supervision

Harshit Sikchi,Siddhant Agarwal,Pranaya Jajoo, Samyak Parajuli, Caleb Chuck,Max Rudolph,Peter Stone, Amy Zhang, Scott Niekum

Neural Information Processing Systems 

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate.