Lightweight Neural App Control
Christianos, Filippos, Papoudakis, Georgios, Coste, Thomas, Hao, Jianye, Wang, Jun, Shao, Kun
–arXiv.org Artificial Intelligence
This paper introduces a novel mobile phone control architecture, termed "app agents", for efficient interactions and controls across various Android apps. The proposed Lightweight Multi-modal App Control (LiMAC) takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, within LiMAC, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines. Smartphone application agents, commonly known as app agents, are expanding the potential applications of artificial intelligence to smartphones and other mobile devices. Such agents could allow users to accomplish a range of tasks, from scheduling appointments and sending messages to purchasing items and booking flights, with minimal effort. Fundamentally, app agents observe user instructions and progressively interact with the smartphone's user interface--by clicking, scrolling, inputting text, etc.--to accomplish the task. However, due to the limited computational resources of smartphones, these agents must be optimised for efficiency, employing lightweight models with minimal memory usage and fast processing speeds. Recent advancements have leveraged foundation models to develop app agents that understand natural language instructions and execute complex user commands within the smartphone's interface (e.g., Rawles et al., 2024; Bai et al., 2024; Wang et al., 2024b;a).
arXiv.org Artificial Intelligence
Oct-23-2024