Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills

Meng, Yuan, Sun, Zhenguo, Fest, Max, Li, Xukun, Bing, Zhenshan, Knoll, Alois

arXiv.org Artificial Intelligence 

Large language models (LLMs)-based code generation for robotic manipulation has recently shown promise by directly translating human instructions into executable code, but existing approaches are limited by language ambiguity, noisy outputs, and limited context windows, which makes long-horizon tasks hard to solve. While closed-loop feedback has been explored, approaches that rely solely on LLM guidance frequently fail in extremely long-horizon scenarios due to LLMs' limited reasoning capability in the robotic domain, where such issues are often simple for humans to identify. Moreover, corrected knowledge is often stored in improper formats, restricting generalization and causing catastrophic forgetting, which highlights the need for learning reusable and extendable skills. To address these issues, we propose a human-in-the-loop lifelong skill learning and code generation framework that encodes feedback into reusable skills and extends their functionality over time. An external memory with Retrieval-Augmented Generation and a hint mechanism supports dynamic reuse, enabling robust performance on long-horizon tasks. Experiments on Ravens, Franka Kitchen, and MetaWorld, as well as real-world settings, show that our framework achieves a 0.93 success rate (up to 27% higher than baselines) and a 42% efficiency improvement in feedback rounds. It can robustly solve extremely long-horizon tasks such as "build a house", which requires planning over 20 primitives. Code will be open-sourced upon acceptance. Y ou should try to preserve the previous functionality . Large language models (LLMs) and vision-language models (VLMs) have become integral to robotic manipulation due to their robust commonsense knowledge and advanced reasoning capabilities. Early approaches Co-Reyes et al. (2018); Lynch et al. (2023); Liu et al. (2023) relied on language embeddings conditioned within reinforcement learning or imitation learning to align robot actions with human commands. These methods often struggled with limited data efficiency and poor generalization. With the rapid progress of LLMs such as GPT, a natural direction has been to integrate them into the pipeline for task decomposition and language grounding Zhang et al. (2023); Huang et al. (2023); Guo et al. (2024). In this setting, an LLM decomposes a complex manipulation task into sub-tasks and invokes a pre-trained language-conditioned policy to execute low-level primitives. This approach assumes that the pre-trained policy can carry out each motion precisely, yet in practice, this is rarely possible due to environmental perturbations and imperfect policy design. Another direction for advancing human-level robotic manipulation is to adopt LLM or VLM backbones for large-scale pretraining on robotic data, creating end-to-end vision-language-action (VLA) foundation models Kim et al. (2024); Black et al. (2024); Bjorck et al. (2025).