Residual Stream Analysis of Overfitting And Structural Disruptions
–Neural Information Processing Systems
Ensuring that large language models (LLMs) remain both helpful and harmless poses a significant challenge: fine-tuning on repetitive safety datasets--where unsafe prompts are paired with standard refusal templates--often leads to \emph{false refusals}, in which benign queries are declined. We first quantify this effect, showing that safety data exhibits substantially lower token entropy ($H_{1}\approx9.18$)
Neural Information Processing Systems
Jun-12-2026, 00:48:42 GMT
- Technology: