div id
WAFFLE: Multi-Modal Model for Automated Front-End Development
Liang, Shanchao, Jiang, Nan, Qian, Shangshu, Tan, Lin
Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML's hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML's hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs' understanding of HTML's structure and a contrastive fine-tuning approach to align LLMs' understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Dominican Republic (0.04)
- North America > Canada > Ontario > Toronto (0.04)
PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM
Fu, Kelin, Tian, Yang, Bian, Kaigui
Smartphones have significantly enhanced our daily learning, communication, and entertainment, becoming an essential component of modern life. However, certain populations, including the elderly and individuals with disabilities, encounter challenges in utilizing smartphones, thus necessitating mobile app operation assistants, a.k.a. mobile app agent. With considerations for privacy, permissions, and cross-platform compatibility issues, we endeavor to devise and develop PeriGuru in this work, a peripheral robotic mobile app operation assistant based on GUI image understanding and prompting with Large Language Model (LLM). PeriGuru leverages a suite of computer vision techniques to analyze GUI screenshot images and employs LLM to inform action decisions, which are then executed by robotic arms. PeriGuru achieves a success rate of 81.94% on the test task set, which surpasses by more than double the method without PeriGuru's GUI image interpreting and prompting design. Our code is available on https://github.com/Z2sJ4t/PeriGuru.
- Health & Medicine (0.94)
- Information Technology (0.68)
Model Of Information System Towards Harmonized Industry And Computer Science
Faith, Edafetanure-Ibeh, Tamarauefiye, Evah Patrick, Uyi, Mark Uwuoruya
The aim of attending an educational institution is learning, which in turn is sought after for the reason of independence of thoughts, ideologies as well as physical and material independence. This physical and material independence is gotten from working in the industry, that is, being a part of the independent working population of the country. There needs to be a way by which students upon graduation can easily adapt to the real world with necessary skills and knowledge required. This problem has been a challenge in some computer science departments, which after effects known after the student begins to work in an industry. The objectives of this project include: Designing a web based chat application for the industry and computer science department, Develop a web based chat application for the industry and computer science and Evaluate the web based chat application for the industry and computer science department. Waterfall system development lifecycle is used in establishing a system project plan, because it gives an overall list of processes and sub-processes required in developing a system. The descriptive research method applied in this project is documentary analysis of previous articles. The result of the project is the design, software a web-based chat application that aids communication between the industry and the computer science department and the evaluation of the system. The application is able to store this information which can be decided to be used later. Awareness of the software to companies and universities, implementation of the suggestions made by the industry in the computer science curriculum, use of this software in universities across Nigeria and use of this not just in the computer science field but in other field of study
- North America > Canada (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom (0.04)
- (4 more...)
- Information Technology > Software (0.46)
- Education > Educational Setting > Higher Education (0.46)
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
Zheng, Longtao, Wang, Rundong, Wang, Xinrun, An, Bo
Building agents with large language models (LLMs) for computer control is a burgeoning research area, where the agent receives computer states and performs actions to complete complex tasks. Previous computer agents have demonstrated the benefits of in-context learning (ICL); however, their performance is hindered by several issues. First, the limited context length of LLMs and complex computer states restrict the number of exemplars, as a single webpage can consume the entire context. Second, the exemplars in current methods, such as high-level plans and multi-choice questions, cannot represent complete trajectories, leading to suboptimal performance in long-horizon tasks. Third, existing computer agents rely on task-specific exemplars and overlook the similarity among tasks, resulting in poor generalization to novel tasks. To address these challenges, we introduce Synapse, a computer agent featuring three key components: i) state abstraction, which filters out task-irrelevant information from raw states, allowing more exemplars within the limited context, ii) trajectory-as-exemplar prompting, which prompts the LLM with complete trajectories of the abstracted states and actions to improve multi-step decision-making, and iii) exemplar memory, which stores the embeddings of exemplars and retrieves them via similarity search for generalization to novel tasks. We evaluate Synapse on MiniWoB++, a standard task suite, and Mind2Web, a real-world website benchmark. In MiniWoB++, Synapse achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, Synapse is the first ICL method to solve the book-flight task in MiniWoB++. Synapse also exhibits a 56% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web.
- North America > United States > Connecticut > Hartford County > Hartford (0.04)
- North America > United States > New York > Suffolk County > Islip (0.04)
- North America > United States > Texas > Taylor County > Abilene (0.04)
- (4 more...)
- Workflow (1.00)
- Research Report (0.63)
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Furuta, Hiroki, Lee, Kuang-Huei, Nachum, Ofir, Matsuo, Yutaka, Faust, Aleksandra, Gu, Shixiang Shane, Gur, Izzeddin
The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. Web navigation is a class of sequential decision making problems where agents interact with web interfaces following user instructions (Shi et al., 2017; Liu et al., 2018; Gur et al., 2019). Common web navigation tasks include, for example, form filling (Diaz et al., 2013), information retrieval (Nogueira & Cho, 2016; Adolphs et al., 2022), or sending emails via a sequence of interactions with computer interface such as click or type (Figure 1). Recently, there has been a growing interest in developing agents to automate these actions and free humans from repetitive interactions (Mazumder & Riva, 2020; Li et al., 2020; Shvo et al., 2021). Most prior works studied web navigation problems as online RL to learn the optimal action distribution with task-specific models from scratch (Liu et al., 2018; Gur et al., 2019; Jia et al., 2019; Humphreys et al., 2022).
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Mobile-Env: An Evaluation Platform and Benchmark for Interactive Agents in LLM Era
Zhang, Danyang, Chen, Lu, Zhao, Zihan, Cao, Ruisheng, Yu, Kai
Diverse evaluation benchmarks play a crucial role to assess a wide range of capabilities of large language models (LLM). Although plenty of endeavors have been dedicated to building valuable benchmarks, there is still little work aiming at evaluating the capability of LLM in multistep interactive environments. Noticing that LLM requires a text representation of the environment observations for interaction, we choose to fill such a blank by building a novel benchmark based on the information user interface (InfoUI). InfoUI consists of rich text contents and can be represented in some text formats, thus is suitable for the assessment of interaction ability of LLM. Additionally, the complex structures of InfoUI can further raise a challenge for LLM to understand structured texts rather than plain texts. An interaction platform is always used to evaluate an agent, however, there is still a lack of a satisfactory interaction platform dedicated to InfoUI. Consequently, we propose to build a novel easily-extendable, adaptable, and close-to-reality interaction platform, Mobile-Env, to provide a base for an appropriate benchmark. Based on Mobile-Env, an InfoUI task set WikiHow is then built to establish a benchmark for the multistep interaction capability of LLM in structured text-based environments. Agents based on a series of LLMs are tested on the task set to obtain an insight into the potential and challenge of LLM for InfoUI interaction. It is sincerely welcome that the community contribute new environments and new task sets for Mobile-Env to provide better test benchmarks and facilitate the development of the corresponding domains.
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- Europe > United Kingdom > England (0.04)
- Europe > Austria (0.04)
- (5 more...)
Enabling Conversational Interaction with Mobile UI using Large Language Models
Wang, Bryan, Li, Gang, Li, Yang
Conversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, which is expensive and effort-consuming. Recently, pre-trained large language models (LLMs) have been shown capable of generalizing to various downstream tasks when prompted with a handful of examples from the target task. This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single LLM. We designed prompting techniques to adapt an LLM to mobile UIs. We experimented with four important modeling tasks that address various scenarios in conversational interaction. Our method achieved competitive performance on these challenging tasks without requiring dedicated datasets and training, offering a lightweight and generalizable approach to enable language-based mobile interaction.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Germany > Hamburg (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- (7 more...)
- Education > Curriculum > Subject-Specific Education (0.47)
- Health & Medicine (0.46)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)