RAR: Setting Knowledge Tripwires for Retrieval Augmented Rejection

Buonocore, Tommaso Mario, Parimbelli, Enea

May-21-2025–arXiv.org Artificial Intelligence

Content moderation for large language models (LLMs) is increasingly critical as LLMs are deployed in user-facing applications. Traditional moderation often relies on static classifiers or handcrafted prompt filters, which struggle to adapt quickly to new threats [1] . Recent analyses show that even retrieval-augmented generation (RAG) pipelines can inadvertently introduce safety risks, causing models to change their safety profile [2] . This paper presents retrieval-augmented rejection (RAR), a novel approach that repurposes the RAG architecture [3], typically used to enhance LLM knowledge, as a dynamic content moderation mechanism. By intentionally adding documents that mimic harmful content and questions (which we term "negative documents") to the vector database and flagging them accordingly, the system can leverage the retrieval mechanism to identify and reject malicious queries without requiring model retraining or architectural changes. The key contributions of this work include: i) a novel content moderation approach that requires no architectural changes to existing RAG systems; ii) a methodology for creating and maintaining "negative documents" for effective query filtering; iii) a flexible threshold-based rejection mechanism that can be dynamically adjusted; iv) preliminary evaluation against existing content moderation approaches. 1 1.1.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

May-21-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (0.69)
- Information Technology > Security & Privacy (0.49)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Performance Analysis > Accuracy (0.77)
    - Neural Networks > Deep Learning (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found