clevr
Supplemental: A Benchmark for Compositional Text-to-image Retrieval
GQA GQA has annotations of objects and attributes in images. We use this to construct queries like "square white plate". We train on the GQA train split (with the test unseen queries and corresponding images removed). Hence, we have around 67K training images and 27K queries. CLEVR On CLEVR, we test on 96 classes on 22,500 images.
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- North America > United States > Massachusetts (0.04)
- Europe > Switzerland (0.04)
- Asia > Middle East > Saudi Arabia > Mecca Province > Thuwal (0.04)
- North America > United States > Massachusetts (0.04)
- Europe > Switzerland (0.04)
- Asia > Middle East > Saudi Arabia > Mecca Province > Thuwal (0.04)
Supplemental: A Benchmark for Compositional Text-to-image Retrieval
GQA GQA has annotations of objects and attributes in images. We use this to construct queries like "square white plate". We train on the GQA train split (with the test unseen queries and corresponding images removed). Hence, we have around 67K training images and 27K queries. CLEVR On CLEVR, we test on 96 classes on 22,500 images.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
Reviews: Visual Concept-Metaconcept Learning
Overall this is a really interesting idea incorporating concrete visual concepts and more abstract metaconcepts in a joint space and using the learning of one to guide the other. There are some issues below, mostly details about training implementation, that could clear up my questions. 1. Why not use pretrained word embeddings for the GRU model? The issue here is that the object proposal generator was trained on ImageNet, meaning it almost definitely had access to visual information about the held out concepts in Ctest. The GRU baseline, even signficantly less training data, outperforms for instance-of.
Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets
Eiter, Thomas, Hadl, Jan, Higuera, Nelson, Oetsch, Johannes
Visual Question Answering (VQA) is the task of answering a question about an image and requires processing multimodal input and reasoning to obtain the answer. Modular solutions that use declarative representations within the reasoning component have a clear advantage over end-to-end trained systems regarding interpretability. The downside is that crafting the rules for such a component can be an additional burden on the developer. We address this challenge by presenting an approach for declarative knowledge distillation from Large Language Models (LLMs). Our method is to prompt an LLM to extend an initial theory on VQA reasoning, given as an answer-set program, to meet the requirements of the VQA task. Examples from the VQA dataset are used to guide the LLM, validate the results, and mend rules if they are not correct by using feedback from the ASP solver. We demonstrate that our approach works on the prominent CLEVR and GQA datasets. Our results confirm that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.
- Europe > Austria > Vienna (0.14)
- Europe > Sweden > Jönköping County > Jönköping (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Reviews: A simple neural network module for relational reasoning
The paper proposes a plug and play module (called Relation Networks (RNs)) specialized for relational reasoning. The module is composed of Multi Layer Perceptrons and considers relations between all pairs of objects. The proposed module when plugged into traditional networks achieves state of the art performance on the CLEVR visual question answering dataset, state of the art (with joint training for all tasks) on the bAbI textual question answering dataset and high performance (93% on one task and 95% on another) on a newly collected dataset of simulated physical mass-spring systems. The paper also collects a dataset similar to CLEVR to demonstrate the effectiveness of the proposed RNs for relational questions. The proposed Relation Network is a novel neural network specialized for relational reasoning.
Reviews: Bias and Generalization in Deep Generative Models: An Empirical Study
After reading author responses and discussing with other reviewers, I have decided to raise my score. I think the authors did a good job in their response to the points I raised. However, I still think that their should be more emphasis in the paper on the significance of the observations made in the paper which was not clear to me at first. The study relies on probative experiments using synthetic image datasets (e.g. CLEVR, colored dots, pie shapes with various color proportions) in which observations can be explained by few, independent factors or features (e.g.