Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

Open in new window