Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

Huang, Yiming, Liu, Xiao, Gong, Yeyun, Gou, Zhibin, Shen, Yelong, Duan, Nan, Chen, Weizhu

arXiv.org Artificial Intelligence 

Large language models (LLMs) have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality and reasoning-focused training datasets. Addressing this challenge, we propose Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0%

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found