Residual Stream Analysis of Overfitting And Structural Disruptions

Neural Information Processing Systems 

Ensuring that large language models (LLMs) remain both helpful and harmless poses a significant challenge: fine-tuning on repetitive safety datasets--where unsafe prompts are paired with standard refusal templates--often leads to \emph{false refusals}, in which benign queries are declined. We first quantify this effect, showing that safety data exhibits substantially lower token entropy ($H_{1}\approx9.18$)