VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Liu, Junpeng, Song, Yifan, Lin, Bill Yuchen, Lam, Wai, Neubig, Graham, Li, Yuanzhi, Yue, Xiang

Apr-8-2024–arXiv.org Artificial Intelligence

Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained abilities such as OCR, understanding, and grounding. In this paper, we introduce VisualWebBench, a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks. VisualWebBench consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains. We evaluate 14 open-source MLLMs, Gemini Pro, Claude-3 series, and GPT-4V(ision) on VisualWebBench, revealing significant challenges and performance gaps. Further analysis highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs. We believe VisualWebBench will serve as a valuable resource for the research community and contribute to the creation of more powerful and versatile MLLMs for web-related applications.

benchmark, mllm, visualwebbench, (15 more...)

arXiv.org Artificial Intelligence

Apr-8-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Niue (0.04)
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Texas > Travis County
      - Austin (0.04)
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.04)
    - Nevada > Clark County
      - Las Vegas (0.04)
    - Hawaii > Honolulu County
      - Honolulu (0.04)
    - California > Los Angeles County
      - Long Beach (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Switzerland > Zürich
    - Zürich (0.14)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
- Asia > China
  - Hong Kong (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology
  - Communications > Web (1.00)
  - Artificial Intelligence > Natural Language
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found