Understanding Catastrophic Forgetting in Language Models via Implicit Inference
Kotha, Suhas, Springer, Jacob Mitchell, Raghunathan, Aditi
–arXiv.org Artificial Intelligence
Fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback) is a crucial step in training language models to robustly carry out tasks of interest. However, we lack a systematic understanding of the effects of fine-tuning, particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of suppressing model capabilities on other tasks. This degradation is especially pronounced for tasks "closest" to the fine-tuning distribution. We hypothesize that language models implicitly infer the task of the prompt corresponds, and the fine-tuning process predominantly skews this task inference towards tasks in the fine-tuning distribution. To test this hypothesis, we propose Conjugate Prompting to see if we can recover pretrained capabilities. Conjugate prompting artificially makes the task look farther from the fine-tuning distribution while requiring the same capability. We find that conjugate prompting systematically recovers some of the pretraining capabilities on our synthetic setup. We then apply conjugate prompting to real-world LLMs using the observation that fine-tuning distributions are typically heavily skewed towards English. We find that simply translating the prompts to different languages can cause the fine-tuned models to respond like their pretrained counterparts instead. This allows us to recover the in-context learning abilities lost via instruction tuning, and more concerningly, to recover harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT. The development of large language models (LLMs) typically involves two stages--pretraining (next token prediction) on vast text corpora and fine-tuning on carefully curated datasets to adapt the pretrained model to the application of interest. One fundamental concern is that fine-tuning datasets are considerably smaller and less diverse than web-scale pretraining datasets (Raffel et al., 2020; Arivazhagan et al., 2019; Gao et al., 2021), and there is always a risk that the fine-tuned model "catastrophically forgets" (McCloskey & Cohen, 1989) how to solve problems that the pretrained model could solve. Such a gap has been reported as an "alignment tax" in works such as Ouyang et al. (2022) and Bai et al. (2022), but there is no clear understanding of what these trade-offs are and how to mitigate them. Given the importance of the fine-tuning process, it is imperative to build a systematic understanding of the effects.
arXiv.org Artificial Intelligence
Sep-18-2023
- Country:
- North America > United States (0.14)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Government (0.68)
- Law Enforcement & Public Safety
- Crime Prevention & Enforcement (0.67)
- Terrorism (0.49)
- Technology: