Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models

Luo, Wenjie, Li, Ruocheng, Zhu, Shanshan, Perry, Julian

Aug-6-2025–arXiv.org Artificial Intelligence

--Despite significant advancements, current large language models (LLMs) and vision-language models (L VLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of "deliberative thinking." They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. T o address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances L VLMs' common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaV A-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source L VLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning. The remarkable advancements in large language models (LLMs) [1], [2] and vision-language models (L VLMs) have revolutionized various aspects of artificial intelligence, demonstrating unprecedented capabilities in understanding, generating, and processing information across modalities [3]. These models excel in tasks ranging from complex question answering to creative content generation, largely due to their extensive pre-training on vast amounts of data.

cmrf, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

Aug-6-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.93)
- Asia > Middle East
  - UAE (0.28)

Genre:
- Research Report (1.00)
- Overview > Innovation (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Commonsense Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Cognitive Science > Problem Solving (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found