R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models

Deng, Linger, Liu, Yuliang, Li, Bohan, Luo, Dongliang, Wu, Liang, Zhang, Chengquan, Lyu, Pengyuan, Zhang, Ziyang, Zhang, Gang, Ding, Errui, Zhu, Yingying, Bai, Xiang

Oct-27-2024–arXiv.org Artificial Intelligence

Existing Large Multimodal Models (LMMs) struggle with mathematical geometric reasoning due to a lack of high-quality image-text paired data. Current geometric data generation approaches, which apply preset templates to generate geometric data or use Large Language Models (LLMs) to rephrase questions and answers (Q&A), unavoidably limit data accuracy and diversity. To synthesize higherquality data, we propose a two-stage Reverse Chain-of-Thought (R-CoT) geometry problem generation pipeline. First, we introduce GeoChain to produce highfidelity geometric images and corresponding descriptions highlighting relations among geometric elements. We then design a Reverse A&Q method that reasons step-by-step based on the descriptions and generates questions in reverse from the reasoning results. Experiments demonstrate that the proposed method brings significant and consistent improvements on multiple LMM baselines, achieving new performance records in the 2B, 7B, and 8B settings. Notably, R-CoT-8B significantly outperforms previous state-of-the-art open-source mathematical models by 16.6% on MathVista and 9.2% on GeoQA, while also surpassing the closedsource model GPT-4o by an average of 13% across both datasets. The code is available at https://github.com/dle666/R-CoT. Large Language Models (LLMs) exhibit excellent reasoning capabilities and draw extensive attention from the artificial intelligence research community (Lu et al., 2023b) to mathematical problemsolving in textual form (Chen et al., 2024b; Liao et al., 2024; Zhou et al., 2024; Zhao et al., 2024b; Zhou & Zhao, 2024; Kim et al., 2024). However, LLMs still struggle to solve mathematical problems involving images that require visual comprehension. Geometry problems, as typical mathematical problems with images, play an important role in evaluating mathematical reasoning skills (Zhang et al., 2023c), requiring a high level of visual comprehension. Besides, even though some problems are not related to geometry on the surface, they require the same skills for models (e.g., fine-grained image comprehension skills and multi-step reasoning skills). With the appearance of o1 (OpenAI, 2024), GPT-4o (Islam & Moushi, 2024), Gemini (Team et al., 2023), and numerous Large Multimodal Models (LMMs) (Li et al., 2024a; Liu et al., 2024; Chen et al., 2024d; Bai et al., 2023), recent researches progressively investigate using LMMs to solve mathematical geometry problems. Although LMMs show impressive results in general visual question-answering (VQA) tasks (Fan et al., 2024; Liu et al., 2024), they still face challenges in solving mathematical geometry problems. Adjust values in the question and generate answers.

equilateral triangle, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Oct-27-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)