LM-mixup: Text Data Augmentation via Language Model based Mixup
Deng, Zhijie, Shen, Zhouan, Li, Ling, Zhou, Yao, Zhu, Zhaowei, He, Yanji, Wang, Wei, Wei, Jiaheng
–arXiv.org Artificial Intelligence
Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs. The code and the dataset are available at: https://github.com/yuu250/LM-mixup. In recent years, large language models (LLMs) have achieved notable progress in natural language processing and multimodal understanding (Team et al., 2023; Singhal et al., 2023; Deng et al., 2025; Li et al., 2024b; 2025a; Pang et al., 2025b). This progress stems not only from improved architectures and larger scales but also from more efficient ways for models to learn and apply knowledge (Patil & Jadon, 2025; Dredze, 2025). While the conventional view holds that high-quality human alignment requires massive annotated data (Kim et al., 2024; K opf et al., 2023), recent studies show that LLMs acquire most knowledge during pre-training (Brown et al., 2020; Roberts et al., 2020). This shifts the research focus from "more data" to "better data", emphasizing efficient high-quality data selection for model improvement. However, high-quality samples are scarce and costly, while real-world datasets contain abundant redundant or low-quality data, leading to significant information waste.
arXiv.org Artificial Intelligence
Oct-24-2025
- Country:
- Asia > China
- Guangdong Province > Guangzhou (0.04)
- Hong Kong (0.04)
- Europe > United Kingdom (0.04)
- North America
- Canada > Saskatchewan (0.04)
- United States > California
- Santa Clara County > Palo Alto (0.04)
- Asia > China
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Government (0.46)
- Health & Medicine (0.46)
- Technology: