EPO: Hierarchical LLM Agents with Environment Preference Optimization
Zhao, Qi, Fu, Haotian, Sun, Chen, Konidaris, George
–arXiv.org Artificial Intelligence
Long-horizon decision-making tasks present significant challenges for LLM-based agents due to the need for extensive planning over multiple steps. In this paper, we propose a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation. To address the challenge of creating training signals for unannotated datasets, we develop a reward model that leverages multimodal environment feedback to automatically generate reward signals. We introduce Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment's feedback and uses them to train LLM-based agents. Extensive experiments on ALFRED demonstrate the state-of-the-art performance of our framework, achieving first place on the ALFRED public leaderboard and showcasing its potential to improve long-horizon decision-making in diverse environments.
arXiv.org Artificial Intelligence
Aug-28-2024
- Country:
- Asia > South Korea
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Montserrat (0.04)
- United States
- Georgia > Fulton County
- Atlanta (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Georgia > Fulton County
- Canada
- Genre:
- Research Report > Promising Solution (0.34)
- Industry:
- Education (0.46)
- Technology: