MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

Kwak, SeokJoo, Kim, Jihoon, Kim, Boyoun, Yoon, Jung Jae, Jang, Wooseok, Hong, Jeonghoon, Yang, Jaeho, Kwon, Yeong-Dae

Nov-18-2025–arXiv.org Artificial Intelligence

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-18-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Cognitive Science > Problem Solving (1.00)
    - Machine Learning
      - Neural Networks > Deep Learning (0.48)
      - Performance Analysis > Accuracy (0.46)
    - Natural Language > Large Language Model (1.00)
    - Representation & Reasoning > Agents (1.00)
  - Graphics (1.00)
  - Human Computer Interaction (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found