Learning to reason over visual objects

Mondal, Shanka Subhra, Webb, Taylor, Cohen, Jonathan D.

arXiv.org Artificial Intelligence 

Despite the centrality of objects in visual reasoning, previous works have so far not explored the use of object-centric representations in abstract visual reasoning tasks such as RAVEN and PGM, or at best have employed an imprecise approximation to object representations based on spatial location. Recently, a number of methods have been proposed for the extraction of precise object-centric representations directly from pixel-level inputs, without the need for veridical segmentation data (Greff et al., 2019; Burgess et al., 2019; Locatello et al., 2020; Engelcke et al., 2021). While these methods have been shown to improve performance in some visual reasoning tasks, including question answering from video (Ding et al., 2021) and prediction of physical interactions from video Wu et al. (2022), previous work has not addressed whether this approach is useful in the domain of abstract visual reasoning (i.e., visual analogy). To address this, we developed a model that combines an object-centric encoding method, slot attention (Locatello et al., 2020), with a generic transformer-based reasoning module (Vaswani et al., 2017). The combined system, termed the Slot Transformer Scoring Network (STSN, Figure 1) achieves state-of-the-art performance on both PGM and I-RAVEN (a more challenging variant of RAVEN), despite its general-purpose architecture, and lack of task-specific augmentations. Furthermore, we developed a novel benchmark, the CLEVR-Matrices (Figure 2), using a similar RPM-like problem structure, but with greater visual complexity, and found that STSN also achieves state-of-the-art performance on this task. These results suggest that object-centric encoding is an essential component for achieving strong abstract visual reasoning, and indeed may be even more important than some task-specific inductive biases.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found