A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets

Lagasse, Ryan, Kierans, Aidan, Ghosh, Avijit, Dori-Hacohen, Shiri

Jun-4-2025–arXiv.org Artificial Intelligence

We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of examples and their average token length--what we term dataset volume --play a decisive role in model performance. Experiments on the BRICC dataset Salavati et al. (2024) and subsets of the MMLU dataset Hendrycks et al. (2021), evaluated under multiple subsampling strategies, reveal that data composition significantly affects token efficiency. These results motivate refined scaling laws for practical LLM fine-tuning in resource-constrained settings. Code will be made available upon publication.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

Jun-4-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Connecticut > Tolland County > Storrs (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found