A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

Wang, Jiankang, Xu, Jianjun, Wang, Xiaorui, Wang, Yuxin, Xing, Mengting, Fang, Shancheng, Chen, Zhineng, Xie, Hongtao, Zhang, Yongdong

arXiv.org Artificial Intelligence 

Synthesizing high-quality reasoning data for continual training has been proven to be effective in enhancing the performance of Large Language Models (LLMs). However, previous synthetic approaches struggle to easily scale up data and incur high costs in the pursuit of high quality. In this paper, we propose the Graphbased Synthetic Data Pipeline (GSDP), an economical and scalable framework for high-quality reasoning data synthesis. Inspired by knowledge graphs, we extracted knowledge points from seed data and constructed a knowledge point relationships graph to explore their interconnections. By exploring the implicit relationships among knowledge, our method achieves 255 data expansion. Furthermore, GSDP led by open-source models, achieves synthesis quality comparable to GPT-4-0613 while maintaining 100 lower costs. To tackle the most challenging mathematical reasoning task, we present the GSDP-MATH dataset comprising over 1.91 million pairs of math problems and answers. After fine-tuning on GSDP-MATH, GSDP-7B based on Mistral-7B achieves 37.7% accuracy on MATH and 78.4% on GSM8K, demonstrating the effectiveness of our method. The dataset and models trained in this paper will be available. Despite the remarkable capabilities large language models (LLMs) have demonstrated in various linguistic tasks, significant gaps remain in their ability to comprehend and solve intricate reasoning tasks (e.g., mathematics, coding, physics, and chemistry). One effective approach to bridging these gaps is using large-scale, high-quality synthetic data. However, it is still a challenge to develop a low-cost and effective synthesis pipeline. Take mathematics as an example. The two main approaches for building high-quality mathematics reasoning datasets are data filtering and data synthesis. Data filtering (Yue et al., 2024b; Shao et al., 2024; Ying et al., 2024) involves extracting data from pre-training corpora such as Common Crawl, and rewriting it using advanced commercial models or human annotation.