Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

Arif, Huzaifa, Murugesan, Keerthiram, Ko, Ching-Yun, Chen, Pin-Yu, Das, Payel, Gittens, Alex

arXiv.org Artificial Intelligence 

We propose patching for large language models (LLM) like software versions, a lightweight and modular approach for addressing safety vulnerability. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains--toxicity mitigation, bias reduction, and harmfulness refusal--policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases. Large language models (LLMs) have achieved remarkable advances in reasoning, generation, and multilingual capabilities (Brown et al., 2020; Wei et al., 2022; Conneau & Lample, 2019). Despite their impressive capabilities, they continue to exhibit serious safety concerns, such as the generation of toxic language (Gehman et al., 2020), biased associations that reinforce stereotypes (Dong et al., 2024), and the production of harmful or dangerous content (Mazeika et al., 2024). Addressing these risks is crucial to the broader challenge of alignment, where models are refined to better align with human values and expectations. Conventional approaches to improving safety rely on alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Bai et al., 2022; Ouyang et al., 2022), preference-based fine-tuning (Rafailov et al., 2023), domain-specific supervised fine-tuning (Li et al., 2024), etc. While these methods have proven effective, they require substantial computational resources, large-scale data curation, and careful model retraining. In practice, model providers (vendors) often release major updates to their models (major versions) on a fixed schedule, typically once or twice a year.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found