Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond
Chen, Liang, Zhang, Yichi, Ren, Shuhuai, Zhao, Haozhe, Cai, Zefan, Wang, Yuchi, Wang, Peiyi, Liu, Tianyu, Chang, Baobao
–arXiv.org Artificial Intelligence
In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents. While Large Language Models (LLMs) have been widely used due to their advanced reasoning skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual understanding and reasoning capabilities. We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner and whether collaborations between LLMs and MLLMs can enhance decision-making. To address these questions, we introduce a new benchmark called PCA-EVAL, which evaluates embodied decision-making from the perspectives of Perception, Cognition, and Action. Additionally, we propose HOLMES, a multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs to gather multimodal information for informed decision-making. We compare end-to-end embodied decision-making and HOLMES on our benchmark and find that the GPT4-Vision model demonstrates strong end-to-end embodied decision-making abilities, outperforming GPT4-HOLMES in terms of average decision accuracy (+3%). However, this performance is exclusive to the latest GPT4-Vision model, surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents, offering new avenues for MLLM research. The capacity to make well-informed decisions is essential for the survival and success of living organisms in their respective environments. Similarly, a major goal in embodied artificial intelligence is to develop agents, like robots, with sophisticated decision-making abilities. Recently, there has been a notable increase in leveraging exceptional reasoning capabilities and world knowledge of Large Language Models (LLMs) to enhance decision making in agents. However, LLMs are primarily designed to process textual context, creating a modality gap (Liang et al., 2022; Ren et al., 2023a) for the LLM-powered agent when dealing with multimodal observations in real-world scenarios.
arXiv.org Artificial Intelligence
Nov-28-2023
- Country:
- North America > United States (0.28)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Leisure & Entertainment > Games (0.48)
- Technology: