WEPO: Web Element Preference Optimization for LLM-based Web Navigation

Liu, Jiarun, Hao, Jia, Zhang, Chunhong, Hu, Zheng

arXiv.org Artificial Intelligence 

The field of autonomous web navigation has seen significant advancements, driven by the capabilities of Large Language Models (LLMs) in both mobile and webpage interactions [Wang et al., 2024a, Mialon et al., 2023, Xi et al., 2023]. Preliminary attempts, such as the ChatGPT Plugin [OpenAI, 2023], have also started building practical applications of web knowledge-based chatbot. Web navigation can be described as processes where agents perform specific tasks on behalf of human users within a web environment, involving the interpretation of high-level user instructions, decomposing them into basic operations, and interacting with complex web pages dynamically. To achieve this, agents must understand intricate web scenarios, adapt to dynamic changes such as noisy text and evolving HTML structures, and generalize successful operations to unseen tasks, thus freeing humans from repetitive interactions with computer interfaces. Traditional web agents trained through reinforcement learning [Shi et al., 2017, Yao et al., 2022] often mimic human behavior using predefined actions like typing, searching, and navigating to a specific page.