Towards Inference-time Category-wise Safety Steering for Large Language Models

Open in new window