Steering Over-refusals Towards Safety in Retrieval Augmented Generation
Maskey, Utsav, Dras, Mark, Naseem, Usman
–arXiv.org Artificial Intelligence
Safety alignment in large language models (LLMs) induces over-refusals -- where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.
arXiv.org Artificial Intelligence
Oct-14-2025
- Country:
- Asia > China
- Hong Kong (0.04)
- Europe
- North America > United States
- Florida > Miami-Dade County > Miami (0.04)
- Asia > China
- Genre:
- Research Report (1.00)
- Industry:
- Banking & Finance (0.68)
- Health & Medicine > Therapeutic Area (0.68)
- Information Technology > Security & Privacy (1.00)
- Law (0.70)
- Technology: