When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

Zhang, Biao, Liu, Zhongtao, Cherry, Colin, Firat, Orhan

Feb-26-2024–arXiv.org Artificial Intelligence

While large language models (LLMs) often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance. We consider two types of finetuning - full-model tuning (FMT) and parameter efficient tuning (PET, including prompt tuning and LoRA), and explore their scaling behaviors in the data-limited regime where the LLM model size substantially outweighs the finetuning data size. Based on two sets of pretrained bilingual LLMs from 1B to 16B and experiments on bilingual machine translation and multilingual summarization benchmarks, we find that 1) LLM finetuning follows a powerbased multiplicative joint scaling law between finetuning data size and each other scaling factor; 2) LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective; and 3) the optimal finetuning method is highly task-and finetuning data-dependent. We hope our findings could shed light on understanding, selecting and developing LLM finetuning methods. Advanced LLMs, such as GPT-4 (OpenAI, 2023) and PaLM 2 (Anil et al., 2023), often show emergent capabilities and allow for in-context learning that could use just a few demonstration examples to perform complex reasoning and generation tasks (Wei et al., 2022; Zhang et al., 2023; Fu et al., 2023; Shen et al., 2023). Still, LLM finetuning is required and widely adopted to unlock new and robust capabilities for creative tasks, get the most for focused downstream tasks, and align its value with human preferences (Ouyang et al., 2022; Yang et al., 2023; Gong et al., 2023; Schick et al., 2023). This becomes more significant in traditional industrial applications due to the existence of large-scale annotated task-specific data accumulated over years.

lora, sentence, wmt en zh, (12 more...)

arXiv.org Artificial Intelligence

Feb-26-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States > Minnesota
    - Hennepin County > Minneapolis (0.14)
- Europe
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Romania > Sud - Muntenia Development Region
    - Giurgiu County > Giurgiu (0.04)
- Asia > Middle East
  - Jordan (0.04)
  - UAE > Abu Dhabi Emirate
    - Abu Dhabi (0.04)

Genre:
- Research Report > New Finding (0.87)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found