UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
Han, Wenkang, Zeng, Zhixiong, Huang, Jing, Jiang, Shu, Zheng, Liming, Yang, Longrong, Qiu, Haibo, Yao, Chang, Chen, Jingyuan, Ma, Lin
–arXiv.org Artificial Intelligence
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.
arXiv.org Artificial Intelligence
Nov-27-2025
- Country:
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning (1.00)
- Natural Language > Large Language Model (0.95)
- Representation & Reasoning > Agents (0.70)
- Speech (0.87)
- Graphics (1.00)
- Human Computer Interaction > Interfaces (1.00)
- Artificial Intelligence
- Information Technology