CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Yu, Ping, Lanchantin, Jack, Wang, Tianlu, Yuan, Weizhe, Golovneva, Olga, Kulikov, Ilia, Sukhbaatar, Sainbayar, Weston, Jason, Xu, Jing

Sep-4-2025–arXiv.org Artificial Intelligence

We propose CoT -Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on given seed tasks, and then generate a new synthetic example of similar quality and complexity. This is followed by a filtering step to select high-quality data using automatic metrics, which are then used for LLM training. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, when evaluated on MA TH500, AMC23, AIME24, and GPQA-Diamond. The transformative rise of Large Language Models (LLMs) has initiated a substantial paradigm shift in the domain of deep learning (Zhang et al., 2023; Guo et al., 2023; Long et al., 2024). The development of such models emphasizes scale, and relies heavily on large volumes of high-quality data (Gandhi et al., 2024; Abdin et al., 2024). However, acquiring such data from human sources can often be challenging or even impractical due to factors such as high costs, data scarcity, and privacy concerns (Kurakin et al., 2023). Furthermore, several studies (Hosking et al., 2023; Singh et al., 2023; Gilardi et al., 2023) have pointed out that human-generated data, being inherently prone to biases and errors, may not always be ideal for model training or evaluation. In this context, synthetic data emerges as a viable alternative for obtaining high-quality datasets. Synthetic data is artificially generated to replicate the characteristics and patterns of real-world data. One innovative approach to creating such data is the Self-Instruct method (Wang et al., 2022a), which utilizes LLMs themselves to generate instruction-following examples. This method begins by selecting a small set of seed instruction-following samples, which are then used to prompt LLMs to produce additional demonstrations in a similar format. Since then, a number of variants have been introduced that increase the complexity of queries (Liu et al., 2023; Zeng et al., 2024), maintain semantic diversity (Ding et al., 2023), scale the synthetic data (Y uan et al., 2023), and use these methods in self-improvement loops (Y uan et al., 2024).

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Sep-4-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > Promising Solution (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)