DACL-RAG: Data Augmentation Strategy with Curriculum Learning for Retrieval-Augmented Generation

Wang, Shaohan, Zhang, Licheng, Fu, Zheren, Mao, Zhendong, Zhang, Yongdong

arXiv.org Artificial Intelligence 

Abstract--Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods typically optimize the retriever or the generator in a RAG system by directly using the top-k retrieved documents. However, two key issues inherent in the training data constrain the effectiveness of this training paradigm: (1) across different queries, the top-k retrieved documents vary greatly in content quality, with some providing valuable knowledge while others lack critical information or are even misleading, and training on such data in a purely random manner may impair the generator's ability to extract key information; (2) for a given query, the limited set of k documents often exhibits low discriminability, and training solely on them makes it difficult for the retriever to learn how to distinguish between relevant and irrelevant documents. T o address these issues, we introduce DACL-RAG, a multi-stage RAG training framework that combines a multi-level Data Augmentation strategy with a multistage Curriculum Learning paradigm. The data augmentation strategy constructs comprehensive and diverse training sets with controllable difficulty levels through sample evolution, while the curriculum learning paradigm organizes them into progressive stages for training, ensuring stable and consistent improvements, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our DACL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods. ARGE language models (LLMs) have demonstrated remarkable capabilities in a wide range of Natural Language Processing (NLP) tasks [1]-[3], but they are still constrained by the limitations of the static knowledge embedded within their internal parameters [4]-[6]. Retrieval-Augmented Generation (RAG) addresses this limitation by supplementing LLMs with additional knowledge retrieved from external knowledge bases, and has significantly enhanced the capabilities of existing large models in tasks such as Open-Domain Question Answering [7]-[17] and Dialog System [18]-[20]. The overall performance of the RAG system depends crucially on the quality of the retrieved documents and the LLMs' ability to effectively utilize them. Shaohan Wang, Licheng Zhang and Zheren Fu are with the School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui 230022, China (e-mail: wsh2000@mail.ustc.edu.cn; Zhendong Mao and Y ongdong Zhang are with the School of Information Science and Technology, University of Science and Technology of China, and the Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, Anhui 230022, China (e-mail: zhyd73@ustc.edu.cn; Here, green denotes documents that support the model's responses, while red denotes documents that are useless or even harmful.