VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
Jia, Yiming, Li, Jiachen, Yue, Xiang, Li, Bo, Nie, Ping, Zou, Kai, Chen, Wenhu
–arXiv.org Artificial Intelligence
Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity of reasoning-focused multimodal datasets. We propose VisualWebInstruct, a novel approach that leverages search engines to create a diverse and high-quality dataset spanning multiple disciplines, including mathematics, physics, finance, and chemistry, etc. Starting with a meticulously selected set of 30,000 seed images, we employ Google Image Search to identify websites containing similar images. We collect and process HTML data from over 700K unique URLs. Through a pipeline of content extraction, filtering, and synthesis, we construct a dataset of approximately 900K question-answer (QA) pairs, with 40% consisting of visual QA pairs and the remaining comprising text-based QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance improvements: (1) fine-tuning on Llava-OV results in 10-20 absolute points improvement across benchmarks, and (2) fine-tuning from MAmmoTH-VL yields a 5 absolute points gain across benchmarks. Our best model, MAmmoTH-VL2, achieves state-of-the-art performance within the 10B parameter class on MMMU-Pro (40.7), MathVerse (42.6), and DynaMath (55.7). These results highlight the effectiveness of our dataset in enhancing the reasoning capabilities of vision-language models for complex multimodal tasks.
arXiv.org Artificial Intelligence
Mar-14-2025
- Country:
- North America
- Canada > Ontario
- Toronto (0.14)
- United States > California (0.14)
- Canada > Ontario
- North America
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.95)
- Natural Language
- Chatbot (0.70)
- Large Language Model (1.00)
- Question Answering (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Information Technology > Artificial Intelligence