Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
–Neural Information Processing Systems
Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets--including WebQSPand CWQ--we find that the average factual correctness rate is only 57%. To address these issues, we introduce KGQAGen, an LLM-inthe-loop framework that systematically resolves these pitfalls. KGQAGencombines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a 10K-scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation 1.
Neural Information Processing Systems
Jun-17-2026, 18:58:11 GMT
- Country:
- Europe (1.00)
- Asia (1.00)
- North America > United States
- Alabama (0.28)
- Genre:
- Research Report > New Finding (1.00)
- Personal (1.00)
- Industry:
- Leisure & Entertainment > Sports (1.00)
- Government > Regional Government (0.92)
- Media (0.92)
- Technology: