Comparative Multi-View Language Grounding

Mitra, Chancharik, Anwar, Abrar, Corona, Rodolfo, Klein, Dan, Darrell, Trevor, Thomason, Jesse

Nov-13-2023–arXiv.org Artificial Intelligence

In this work, we consider the task of resolving object referents when given a comparative language description. We present a Multi-view Approach to Grounding in Context (MAGiC) that leverages transformers to pragmatically reason over both objects given multiple image views and a language description. In contrast to past efforts that attempt to connect vision and language for this task without fully considering the resulting referential context, MAGiC makes use of the comparative information by jointly reasoning over multiple views of both object referent candidates and the referring language expression. We present an analysis demonstrating that comparative reasoning contributes to SOTA performance on the SNARE object reference task.

distractor, information, representation, (14 more...)

arXiv.org Artificial Intelligence

Nov-13-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Illinois > Cook County
    - Chicago (0.04)
  - California > Alameda County
    - Berkeley (0.04)
- Europe > United Kingdom
  - England > Oxfordshire > Oxford (0.04)
- Asia > Japan
  - Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Robots (0.94)
  - Representation & Reasoning > Object-Oriented Architecture (0.68)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)