RAR: Setting Knowledge Tripwires for Retrieval Augmented Rejection
Buonocore, Tommaso Mario, Parimbelli, Enea
–arXiv.org Artificial Intelligence
Content moderation for large language models (LLMs) is increasingly critical as LLMs are deployed in user-facing applications. Traditional moderation often relies on static classifiers or handcrafted prompt filters, which struggle to adapt quickly to new threats [1] . Recent analyses show that even retrieval-augmented generation (RAG) pipelines can inadvertently introduce safety risks, causing models to change their safety profile [2] . This paper presents retrieval-augmented rejection (RAR), a novel approach that repurposes the RAG architecture [3], typically used to enhance LLM knowledge, as a dynamic content moderation mechanism. By intentionally adding documents that mimic harmful content and questions (which we term "negative documents") to the vector database and flagging them accordingly, the system can leverage the retrieval mechanism to identify and reject malicious queries without requiring model retraining or architectural changes. The key contributions of this work include: i) a novel content moderation approach that requires no architectural changes to existing RAG systems; ii) a methodology for creating and maintaining "negative documents" for effective query filtering; iii) a flexible threshold-based rejection mechanism that can be dynamically adjusted; iv) preliminary evaluation against existing content moderation approaches. 1 1.1.
arXiv.org Artificial Intelligence
May-21-2025
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine (0.69)
- Information Technology > Security & Privacy (0.49)
- Technology: