Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Dai, Yanbo, Ji, Zhenlan, Li, Zongjie, Li, Kuan, Wang, Shuai

Aug-28-2025–arXiv.org Artificial Intelligence

--Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong self-correction ability (SCA) of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems. RAG, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into the context provided to the generator, thereby bypassing the SCA. T o this end, we present a contrastive-learning-based model editing technique that performs localized and stealthy edits, ensuring the retriever returns a malicious instruction only for specific victim queries while preserving benign retrieval behavior . T o further strengthen the attack, we design an iterative co-optimization framework that automatically discovers robust instructions capable of bypassing prompt-based defenses. We extensively evaluate DisarmRAG across six LLMs and three QA benchmarks. Our results show near-perfect retrieval of malicious instructions, which successfully suppress SCA and achieve attack success rates exceeding 90% under diverse defensive prompts. Also, the edited retriever remains stealthy under several detection methods, highlighting the urgent need for retriever-centric defenses. Modern large language models (LLMs) achieve remarkable performance across a wide range of tasks [32], [26], [38]. Despite their success, LLMs are also well known for their hallucination behaviors [25], which generate fabricated content. Such unreliability limits their deployment in critical domains, including healthcare [69] and law [10]. Retrieval-augmented generation (RAG) [37], [29] has emerged as a promising paradigm to mitigate these limitations. By integrating external knowledge, RAG enables LLMs to generate more reliable responses. A key component of RAG is the retriever [27], which encodes both user queries and documents from an external knowledge base [72], [11]. The retriever identifies documents that are most relevant to the input query. These retrieved documents are then combined with the query to guide the LLM in producing grounded responses. Although RAG systems enhance LLMs with external knowledge, their deployment introduces new attack surfaces. Prior work [84], [81], [41], [6] demonstrates the effectiveness of misleading the system to give attack-chosen outputs through injecting malicious content into the knowledge base.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Aug-28-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.93)
- Europe > Austria (0.28)
- North America
  - United States (0.28)
  - Mexico (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)