Data Quality Challenges in Retrieval-Augmented Generation
Müller, Leopold, Holstein, Joshua, Bause, Sarah, Satzger, Gerhard, Kühl, Niklas
–arXiv.org Artificial Intelligence
Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.
arXiv.org Artificial Intelligence
Oct-2-2025
- Country:
- Europe (0.94)
- North America > United States
- Tennessee (0.17)
- Genre:
- Research Report > New Finding (1.00)
- Personal > Interview (0.88)
- Industry:
- Information Technology
- Services (0.86)
- Security & Privacy (0.68)
- Information Technology
- Technology: