ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Aug-6-2025–arXiv.org Artificial Intelligence

Integrating external tools into Large F oundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce T oolVQA, a large-scale multimodal dataset comprising 23K samples, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, T oolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. T o construct this dataset, we propose T oolEngine, a novel data generation pipeline that employs image-guided Depth-First Search (DFS) with a Longest Common Subsequence (LCS)-based example matching mechanism to simulate human-like tool-use reasoning. T oolVQA encompasses 10 multimodal tools across 7 diverse domains, with an average inference length of 2.78 reasoning steps per sample. The LLaVA-7B model fine-tuned on T oolVQA not only achieves impressive performance on the T oolVQA test set, but also surpasses the large closed-source model GPT-3.5-turbo on five out-of-distribution (OOD) datasets, showing strong generalizabil-ity in real-world tool-use scenarios. Code is available at https://github.com/Fugtemypt123/T

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Aug-6-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)

Genre:
- Research Report
  - New Finding (0.74)
  - Promising Solution (0.54)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found