VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Liu, Xiao, Zhang, Tianjie, Gu, Yu, Iong, Iat Long, Xu, Yifan, Song, Xixuan, Zhang, Shudan, Lai, Hanyu, Liu, Xinyi, Zhao, Hanlin, Sun, Jiadai, Yang, Xinyue, Yang, Yu, Qi, Zehan, Yao, Shuntian, Sun, Xueqiao, Cheng, Siyi, Zheng, Qinkai, Yu, Hao, Zhang, Hanchen, Hong, Wenyi, Ding, Ming, Pan, Lihang, Gu, Xiaotao, Zeng, Aohan, Du, Zhengxiao, Song, Chan Hee, Su, Yu, Dong, Yuxiao, Tang, Jie
–arXiv.org Artificial Intelligence
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. These agents are postulated to excel across a myriad of tasks, potentially approaching general artificial intelligence. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments. To address this gap, we introduce VisualAgent-Bench (VAB), a comprehensive and pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs' understanding and interaction capabilities. Through rigorous testing across nine proprietary LMM APIs and eight open models, we demonstrate the considerable yet still developing agent capabilities of these models. Additionally, VAB constructs a trajectory training set constructed through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, promoting substantial performance improvements in LMMs through behavior cloning. Our work not only aims to benchmark existing models but also provides a solid foundation for future development into visual foundation agents.
arXiv.org Artificial Intelligence
Aug-12-2024
- Country:
- North America > United States
- Ohio (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- New York > New York County
- New York City (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Europe > Netherlands
- North Holland > Amsterdam (0.04)
- North America > United States
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Leisure & Entertainment (0.94)
- Materials > Metals & Mining (0.68)
- Consumer Products & Services (0.67)
- Information Technology > Software (0.46)
- Technology: