Grounding Language in Multi-Perspective Referential Communication

Oct-4-2024–arXiv.org Artificial Intelligence

We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 humanwritten referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success Figure 1: Example scene from our environment and when paired with a listener, resulting in dataset. The center image shows the speaker on the left an improvement from 58.9 to 69.3% in communicative and the listener on the right with their respective fields success and even outperforming the of view (FOV). The speaker refers to the target object, strongest proprietary model.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Oct-4-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Canada > Saskatchewan (0.24)
  - United States > California (0.28)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)
  - Natural Language > Large Language Model (0.70)
  - Representation & Reasoning > Agents (1.00)