RL Zero: Direct Policy Inference from Language Without In-Domain Supervision

Harshit Sikchi,Siddhant Agarwal,Pranaya Jajoo, Samyak Parajuli, Caleb Chuck,Max Rudolph,Peter Stone, Amy Zhang, Scott Niekum

Jun-18-2026, 17:01:03 GMT–Neural Information Processing Systems

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate.

large language model, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Jun-18-2026, 17:01:03 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.93)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Health & Medicine > Consumer Health (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Reinforcement Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found