Grounding Language in Multi-Perspective Referential Communication
Tang, Zineng, Mao, Lingjun, Suhr, Alane
–arXiv.org Artificial Intelligence
We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 humanwritten referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success Figure 1: Example scene from our environment and when paired with a listener, resulting in dataset. The center image shows the speaker on the left an improvement from 58.9 to 69.3% in communicative and the listener on the right with their respective fields success and even outperforming the of view (FOV). The speaker refers to the target object, strongest proprietary model.
arXiv.org Artificial Intelligence
Oct-4-2024
- Country:
- North America
- Canada > Saskatchewan (0.24)
- United States > California (0.28)
- North America
- Genre:
- Research Report > New Finding (1.00)
- Technology: