WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

He, Hongliang, Yao, Wenlin, Ma, Kaixin, Yu, Wenhao, Dai, Yong, Zhang, Hongming, Lan, Zhenzhong, Yu, Dong

Jan-28-2024–arXiv.org Artificial Intelligence

The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jan-28-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report (1.00)
- Workflow (0.78)

Industry:
- Leisure & Entertainment > Sports (0.68)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.50)
    - Natural Language > Large Language Model (1.00)
  - Communications > Web (1.00)