A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

Luo, Junyu, Wu, Bohan, Luo, Xiao, Xiao, Zhiping, Jin, Yiqiao, Tu, Rong-Cheng, Yin, Nan, Wang, Yifan, Yuan, Jingyang, Ju, Wei, Zhang, Ming

Oct-31-2025–arXiv.org Artificial Intelligence

Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM

arxiv preprint arxiv, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

Oct-31-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.28)

Genre:
- Research Report (1.00)
- Overview (1.00)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found