Harnessing Webpage UIs for Text-Rich Visual Understanding

Liu, Junpeng, Ou, Tianyue, Song, Yifan, Qu, Yuxiao, Lam, Wai, Xiong, Chenyan, Chen, Wenhu, Neubig, Graham, Yue, Xiang

Nov-6-2024–arXiv.org Artificial Intelligence

Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in element accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.

accessibility tree, instruction, webpage, (17 more...)

arXiv.org Artificial Intelligence

Nov-6-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.04)
    - California > Los Angeles County
      - Long Beach (0.04)
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)

Genre:
- Research Report (0.83)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found