VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Jia, Yiming, Li, Jiachen, Yue, Xiang, Li, Bo, Nie, Ping, Zou, Kai, Chen, Wenhu

Mar-14-2025–arXiv.org Artificial Intelligence

Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity of reasoning-focused multimodal datasets. We propose VisualWebInstruct, a novel approach that leverages search engines to create a diverse and high-quality dataset spanning multiple disciplines, including mathematics, physics, finance, and chemistry, etc. Starting with a meticulously selected set of 30,000 seed images, we employ Google Image Search to identify websites containing similar images. We collect and process HTML data from over 700K unique URLs. Through a pipeline of content extraction, filtering, and synthesis, we construct a dataset of approximately 900K question-answer (QA) pairs, with 40% consisting of visual QA pairs and the remaining comprising text-based QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance improvements: (1) fine-tuning on Llava-OV results in 10-20 absolute points improvement across benchmarks, and (2) fine-tuning from MAmmoTH-VL yields a 5 absolute points gain across benchmarks. Our best model, MAmmoTH-VL2, achieves state-of-the-art performance within the 10B parameter class on MMMU-Pro (40.7), MathVerse (42.6), and DynaMath (55.7). These results highlight the effectiveness of our dataset in enhancing the reasoning capabilities of vision-language models for complex multimodal tasks.

dataset, reasoning, zhang, (15 more...)

arXiv.org Artificial Intelligence

Mar-14-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland (0.04)
- North America
  - United States > California
    - Santa Barbara County > Santa Barbara (0.04)
  - Canada > Ontario
    - Toronto (0.14)

Genre:
- Research Report (1.00)

Industry:
- Education > Curriculum > Subject-Specific Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language
    - Question Answering (1.00)
    - Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found