LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Zhang, Xuemiao, Ren, Can, Tu, Chengying, Weng, Rongxiang, Yan, Hongfei, Wang, Jingang, Cai, Xunliang

Aug-7-2025–arXiv.org Artificial Intelligence

The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of 11.51% on MMLU and CMMLU, establishing new SOT A results. LinkQA consistently enhances performance across model size and initial FLOPs scales.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Aug-7-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.45)
- Europe > Austria (0.28)

Genre:
- Research Report > Experimental Study (0.46)

Industry:
- Health & Medicine > Therapeutic Area (1.00)
- Government > Military (0.93)
- Law > Business Law (0.67)
- Energy (0.67)
- Education
  - Educational Setting (1.00)
  - Health & Safety > School Nutrition (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Cognitive Science > Problem Solving (0.92)
  - Machine Learning > Neural Networks
    - Deep Learning (0.55)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found