Goto

Collaborating Authors

 objectset


Supplementary Material for Grammar-Based Grounded Lexicon Learning

Neural Information Processing Systems

In the supplementary material, we describe the domain specific languages used in our experiments (Section 1), demonstrate how the proposed CKY-E2 method works by a concrete example (Section 2.1), show formal properties of CKY-E2 (Section 2.2), present dataset setups and analyze model behaviors (Section 3), and list environmental details for experiments (Section??). In this section, we will present and discuss the domain-specific languages (DSLs) we use for two domains: visual reasoning and language-guided navigation. We will further introduce the neurosymbolic module we have designed for executing programs in these two domains. Overall, each DSL contains a set of types and a set of deterministic modules that have been manually designed for realizing necessary operations in these domains. However, in contrast to realizing them as we do in standard programming languages (with for-loops and if-conditions), we will be using tensor operations (e.g., tensor additions and multiplications) to realize them so that the output of each program is differentiable with respect to all of its inputs. We refer readers to the original papers for a detailed introduction to the DSL and neuro-symbolic program execution. Here we only highlight the key aspects of our language and its neuro-symbolic realization, and discuss the difference between our implementation and the ones in original papers. Our visual reasoning DSL is a subset of CLEVR, containing 6 types and 8 primitive operations. Table 1 illustrates all 6 types and how they are internally represented in neuro-symbolic execution. Table 2 further shows all operations in the DSL. There are two main differences between the DSL used by G2L2 and the original CLEVRDSL.




A Basic Functions

Neural Information Processing Systems

Each question in PTR is associated with a functional program built from a set of basic functions. A.1 Data Types Our basic functional building blocks operate on values of the following types: Object: A single object in the scene. Part-level functions are listed in Table 4. B have certain spatial relationships. For NS-VQA, we first use Mask-RCNN to propose segmentations for objects and parts. If an object is unstable, possible changes (to_left, to_right, to_front, to_behind) are predicted.


CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions

arXiv.org Artificial Intelligence

Recent advances in Artificial Intelligence and deep learning have revived the interest in studying the gap between the reasoning capabilities of humans and machines. In this ongoing work, we introduce CRAFT, a new visual question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 38K video and question pairs that are generated from 3K videos from 10 different virtual environments, containing different number of objects in motion that interact with each other. Two question categories from CRAFT include previously studied descriptive and counterfactual questions. Besides, inspired by the theory of force dynamics from the field of human cognitive psychology, we introduce new question categories that involve understanding the intentions of objects through the notions of cause, enable, and prevent. Our preliminary results demonstrate that even though these tasks are very intuitive for humans, the implemented baselines could not cope with the underlying challenges.


The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

arXiv.org Artificial Intelligence

We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model learns by simply looking at images and reading paired questions and answers. Our model builds an object-based scene representation and translates sentences into executable, symbolic programs. To bridge the learning of two modules, we use a neuro-symbolic reasoning module that executes these programs on the latent scene representation. Analogical to human concept learning, the perception module learns visual concepts based on the language description of the object being referred to. Meanwhile, the learned visual concepts facilitate learning new words and parsing new sentences. We use curriculum learning to guide the searching over the large compositional space of images and language. Extensive experiments demonstrate the accuracy and efficiency of our model on learning visual concepts, word representations, and semantic parsing of sentences. Further, our method allows easy generalization to new object attributes, compositions, language concepts, scenes and questions, and even new program domains. It also empowers applications including visual question answering and bidirectional image-text retrieval.


A Dataset and Architecture for Visual Reasoning with a Working Memory

arXiv.org Artificial Intelligence

A vexing problem in artificial intelligence is reasoning about events that occur in complex, changing visual stimuli such as in video analysis or game play. Inspired by a rich tradition of visual reasoning and memory in cognitive psychology and neuroscience, we developed an artificial, configurable visual question and answer dataset (COG) to parallel experiments in humans and animals. COG is much simpler than the general problem of video analysis, yet it addresses many of the problems relating to visual and logical reasoning and memory -- problems that remain challenging for modern deep learning architectures. We additionally propose a deep learning architecture that performs competitively on other diagnostic VQA datasets (i.e. CLEVR) as well as easy settings of the COG dataset. However, several settings of COG result in datasets that are progressively more challenging to learn. After training, the network can zero-shot generalize to many new tasks. Preliminary analyses of the network architectures trained on COG demonstrate that the network accomplishes the task in a manner interpretable to humans.