RL Zero: Direct Policy Inference from Language Without In-Domain Supervision
Harshit Sikchi,Siddhant Agarwal,Pranaya Jajoo, Samyak Parajuli, Caleb Chuck,Max Rudolph,Peter Stone, Amy Zhang, Scott Niekum
–Neural Information Processing Systems
The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate.
Neural Information Processing Systems
Jun-18-2026, 17:01:03 GMT
- Country:
- North America > United States (0.93)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Health & Medicine > Consumer Health (1.00)
- Technology: