RAG-Driven Data Quality Governance for Enterprise ERP Systems
Vedat, Sedat Bin, Yarkan, Enes Kutay, Akarsu, Meftun, Karaman, Recep Kaan, Sar, Arda, Çelikbilek, Çağrı, Saygılı, Savaş
–arXiv.org Artificial Intelligence
Abstract--Enterprise ERP systems managing hundreds of thousands of employee records face critical data quality challenges when human resources departments perform decentralized manual entry across multiple languages. We present an end-to-end pipeline combining automated data cleaning with LLMdriven SQL query generation, deployed on a production system managing 240,000 employee records over six months. The system operates in two integrated stages: a multistage cleaning pipeline that performs translation normalization, spelling correction, and entity deduplication during periodic synchronization from Microsoft SQL Server to PostgreSQL; and a retrieval-augmented generation framework powered by GPT-4o that translates natural-language questions in Turkish, Russian, and English into validated SQL queries. The query engine employs LangChain orchestration, FAISS vector similarity search, and few-shot learning with 500+ validated examples. Our evaluation demonstrates 92.5% query validity, 95.1% schema compliance, and 90.7% semantic accuracy on 2,847 production queries. The system reduces query turnaround time from 2.3 days to under 5 seconds while maintaining 99.2% uptime, with GPT-4o achieving 46% lower latency and 68% cost reduction versus GPT-3.5. This modular architecture provides a reproducible framework for AI-native enterprise data governance, demonstrating real-world viability at enterprise scale with 4.3/5.0 I. Introduction When an HR analyst at a multinational construction company needs to answer "How many civil engineers are working on the GPP project in Moscow?", the seemingly simple question becomes a multi-day ordeal. The analyst must contact the IT department, explain the request, wait while IT staff navigate inconsistent data where "Moscow" appears as "Moskva," "Moscow," and "Moskva" in Cyrillic script, manually reconcile project codes stored as "GPP," "Gpp," and "gpp project," and filter between payroll employees and contractors using undocumented business rules. T wo days later, the answer arrives--potentially outdated.
arXiv.org Artificial Intelligence
Nov-24-2025
- Country:
- Asia > Middle East
- Republic of Türkiye
- Karaman Province > Karaman (0.04)
- Konya Province > Konya (0.04)
- Republic of Türkiye
- Europe
- Portugal > Lisbon
- Lisbon (0.04)
- Russia > Central Federal District
- Moscow Oblast > Moscow (0.65)
- United Kingdom > North Sea
- Southern North Sea (0.04)
- Portugal > Lisbon
- North America > United States (0.04)
- Asia > Middle East
- Genre:
- Research Report (0.82)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Law (1.00)
- Technology: