Composition-Grounded Instruction Synthesis for Visual Reasoning

Gu, Xinyi, Mao, Jiayuan, Hong, Zhang-Wei, Yu, Zhuoran, Li, Pengyuan, Joshi, Dhiraj, Feris, Rogerio, He, Zexue

Oct-20-2025–arXiv.org Artificial Intelligence

Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages. Pretrained multi-modal large language models (MLLMs) have achieved impressive performance across a wide range of multimodal tasks (Liu et al., 2023c; Bai et al., 2025; Wang et al., 2025a; Agrawal et al., 2024; OpenAI et al., 2024; Comanici et al., 2025; Anthropic, 2024), yet advanced reasoning capabilities remain underdeveloped, especially in domains where user reasoning-intensive query-answer data is difficult to collect. In this work, we consider reasoning capability over artificial image domains, including charts, tables, information graphs, rendered documents, webpages, etc. While such images are abundant on the web, datasets containing reasoning questions over them are scarce.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Oct-20-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.83)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found