MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Guo, Jarvis, Zheng, Tuney, Bai, Yuelin, Li, Bo, Wang, Yubo, Zhu, King, Li, Yizhi, Neubig, Graham, Chen, Wenhu, Yue, Xiang

Dec-6-2024–arXiv.org Artificial Intelligence

Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

arxiv preprint, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Dec-6-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)
- Europe (1.00)
- Asia (1.00)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Health & Medicine (1.00)
- Government > Military (1.00)
- Banking & Finance (1.00)
- Media > Music (0.67)
- Leisure & Entertainment (0.67)
- Materials > Chemicals
  - Industrial Gases > Liquified Gas (0.46)
  - Commodity Chemicals > Petrochemicals
    - LNG (0.68)
- Energy > Oil & Gas
  - Midstream (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Cognitive Science > Problem Solving (0.85)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)