SE GUI Enhancing Visual Grounding for GUI Agents via Self Evolutionary Reinforcement Learning

Jun-22-2026, 08:21:37 GMT–Neural Information Processing Systems

Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging--especially in complex, high-resolution, professional environments. Traditional supervised fine-tuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL)-based framework that incorporates three core strategies: (1) seed data curation to ensure high-quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self-evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves state-of-the-art results among similarly sized models on three grounding benchmarks.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Jun-22-2026, 08:21:37 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (1.00)

Technology:
- Information Technology
  - Graphics (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Natural Language > Large Language Model (0.95)
    - Machine Learning
      - Reinforcement Learning (0.84)
      - Neural Networks (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found