Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Shao, Juexi, Li, Siyou, Gan, Yujian, Madge, Chris, Karan, Vanja, Poesio, Massimo
–arXiv.org Artificial Intelligence
ABSTRACT Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics. Index T erms-- Visual Grounding, Referring Expression Comprehension, Generalized Referring Expression Comprehension, Coreference, Data Synthesis 1. INTRODUCTION Referring Expression Comprehension (REC) - the task of locating a target referred to by a natural language description.
arXiv.org Artificial Intelligence
Dec-3-2025