META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
Sun, Liangtai, Chen, Xingyu, Chen, Lu, Dai, Tianle, Zhu, Zichen, Yu, Kai
–arXiv.org Artificial Intelligence
Task-oriented dialogue (TOD) systems have been widely used by mobile phone intelligent assistants to accomplish tasks such as calendar scheduling or hotel reservation. Current TOD systems usually focus on multi-turn text/speech interaction, then they would call back-end APIs designed for TODs to perform the task. However, this API-based architecture greatly limits the information-searching capability of intelligent assistants and may even lead to task failure if TOD-specific APIs are not available or the task is too complicated to be executed by the provided APIs. In this paper, we propose a new TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD). A GUI-TOD system can directly perform GUI operations on real APPs and execute tasks without invoking TOD-specific backend APIs. Furthermore, we release META-GUI, a dataset for training a Multi-modal convErsaTional Agent on mobile GUI. We also propose a multi-model action prediction and response model, which show promising results on META-GUI. The dataset, codes and leaderboard are publicly available.
arXiv.org Artificial Intelligence
Nov-24-2022
- Country:
- Pacific Ocean > North Pacific Ocean
- San Francisco Bay (0.04)
- North America > United States
- Minnesota > Hennepin County
- Minneapolis (0.14)
- California > San Francisco County
- San Francisco (0.04)
- Minnesota > Hennepin County
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- Asia
- Middle East > Jordan (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- China > Shanghai
- Shanghai (0.04)
- Pacific Ocean > North Pacific Ocean
- Genre:
- Research Report (0.64)
- Technology: