Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Li, Zhangheng, You, Keen, Zhang, Haotian, Feng, Di, Agrawal, Harsh, Li, Xiujun, Moorthy, Mohana Prasad Sathya, Nichols, Jeff, Yang, Yinfei, Gan, Zhe

Oct-24-2024–arXiv.org Artificial Intelligence

Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-24-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report (0.82)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.69)
    - Natural Language
      - Chatbot (0.91)
      - Large Language Model (1.00)
  - Communications > Mobile (1.00)
  - Human Computer Interaction (1.00)