Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

Open in new window