Data Quality Challenges in Retrieval-Augmented Generation
Müller, Leopold, Holstein, Joshua, Bause, Sarah, Satzger, Gerhard, Kühl, Niklas
–arXiv.org Artificial Intelligence
Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.
arXiv.org Artificial Intelligence
Oct-2-2025
- Country:
- Europe
- Germany
- Baden-Württemberg > Karlsruhe Region
- Karlsruhe (0.05)
- Bavaria > Upper Franconia
- Bayreuth (0.05)
- Baden-Württemberg > Karlsruhe Region
- Sweden
- Stockholm > Stockholm (0.04)
- Uppsala County > Uppsala (0.04)
- Germany
- North America > United States
- Hawaii (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Tennessee > Davidson County
- Nashville (0.06)
- Europe
- Genre:
- Personal > Interview (0.88)
- Research Report > New Finding (1.00)
- Industry:
- Information Technology
- Security & Privacy (0.68)
- Services (0.86)
- Information Technology
- Technology: