Residual Stream Analysis of Overfitting And Structural Disruptions

Jun-12-2026, 00:48:42 GMT–Neural Information Processing Systems

Ensuring that large language models (LLMs) remain both helpful and harmless poses a significant challenge: fine-tuning on repetitive safety datasets--where unsafe prompts are paired with standard refusal templates--often leads to \emph{false refusals}, in which benign queries are declined. We first quantify this effect, showing that safety data exhibits substantially lower token entropy ($H_{1}\approx9.18$)

artificial intelligence, large language model, natural language, (6 more...)

Neural Information Processing Systems

Jun-12-2026, 00:48:42 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)