SELU: Self-Learning Embodied MLLMs in Unknown Environments

Li, Boyu, Jiang, Haobin, Ding, Ziluo, Xu, Xinrun, Li, Haoran, Zhao, Dongbin, Lu, Zongqing

arXiv.org Artificial Intelligence 

Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback like human or environmental feedback is not always available. To address this challenge, existing methods primarily focus on enhancing the decision-making capabilities of MLLMs through voting and scoring mechanisms, while little effort has been paid to improving the environmental comprehension of MLLMs in unknown environments. To fully unleash the self-learning potential of MLLMs, we propose a novel actor-critic self-learning paradigm, dubbed SELU, inspired by the actor-critic paradigm in reinforcement learning. The critic employs self-asking and hindsight relabeling to extract knowledge from interaction trajectories collected by the actor, thereby augmenting its environmental comprehension. Simultaneously, the actor is improved by the self-feedback provided by the critic, enhancing its decision-making. We evaluate our method in the AI2-THOR and VirtualHome environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning. Thanks to their powerful capabilities, many works, e.g., Jarvis-1 (Wang et al., 2023b), STEVE-1 (Lifshitz et al., 2023), and Cradle (Tan et al., 2024b), directly utilize the pre-trained MLLMs to complete various decision-making tasks in different embodied environments. However, the generalization ability of existing pre-trained MLLMs cannot meet the needs of all environments. For some uncommon environments, embodied MLLMs often exhibit hallucinations and poor visual understanding (Huang et al., 2024; Jiang et al., 2024). In more detail, they cannot distinguish left from right and fail to recognize where objects are (Tan et al., 2024b). The reason is that MLLMs have not been further grounded with the environments (Su et al., 2022; Sun et al., 2024). Grounding can be realized by fine-tuning on the experiences from interacting with the environments.