Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning
Pham, Anh, Thalanki, Mihir, Sun, Michael, Chaloo, Aditya, Gupta, Ankita, Xia, Tian, Mate, Aditya, Nosakhare, Ehimwenma, Srinivasan, Soundararajan
–arXiv.org Artificial Intelligence
Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.
arXiv.org Artificial Intelligence
Oct-28-2025
- Country:
- Asia
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- Kentucky (0.04)
- Massachusetts > Hampshire County
- Amherst (0.04)
- Florida > Miami-Dade County
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Law (0.46)
- Technology: