Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

Wu, Chengwei, Du, Li, Zhao, Hanyu, Ju, Yiming, Wang, Jiapu, Chen, Tianyu, Zhou, Haoyi

Oct-29-2025–arXiv.org Artificial Intelligence

Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance, highlighting a critical need to understand what makes training samples effective. This work identifies two fundamental dataset properties that govern SFT scalability: \textbf{semantic coverage}, or the breadth of task domains, and \textbf{information depth}, or the richness of individual examples. We demonstrate that simple proxies for these properties explain the majority of validation loss variance in our experiments. In this work, we further propose the \textbf{Information Landscape Approximation (ILA)}, a model-agnostic data selection framework that jointly optimizes for these two factors. ILA constructs compact subsets that approximate the informational value of large datasets. Empirical results show that models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, a phenomenon we term \textbf{accelerated scaling}.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-29-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Statistical Learning (0.68)
    - Neural Networks > Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found