GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

Yan, An, Yang, Zhengyuan, Zhu, Wanrong, Lin, Kevin, Li, Linjie, Wang, Jianfeng, Yang, Jianwei, Zhong, Yiwu, McAuley, Julian, Gao, Jianfeng, Liu, Zicheng, Wang, Lijuan

Nov-13-2023–arXiv.org Artificial Intelligence

We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https://github.com/zzxslp/MM-Navigator.

large language model, natural language, zero-shot smartphone gui navigation, (4 more...)

arXiv.org Artificial Intelligence

Nov-13-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.53)

Technology:
- Information Technology
  - Artificial Intelligence > Natural Language
    - Large Language Model (0.80)
  - Communications > Mobile (1.00)
  - Graphics (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found