ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

Jin, Yiqiao, Petrangeli, Stefano, Shen, Yu, Wu, Gang

Mar-26-2025–arXiv.org Artificial Intelligence

Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Mar-26-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - New South Wales > Sydney (0.05)
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Georgia > Fulton County
    - Atlanta (0.04)
  - California > Santa Clara County
    - San Jose (0.05)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education > Educational Technology (0.68)

Technology:
- Information Technology
  - Human Computer Interaction (1.00)
  - Graphics (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language > Large Language Model (1.00)
    - Vision (0.95)
    - Machine Learning > Neural Networks
      - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found