AITopics | vqa 2

Country:

Asia > South Korea > Seoul > Seoul (0.05)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Neural Information Processing SystemsFeb-12-2026, 13:55:30 GMT

Chain of Reasoning for Visual Question Answering

Chenfei Wu, Jinlai Liu, Xiaojie Wang, Xuan Dong

Neural Information Processing Systems http://nips.cc/

compound, reasoning, relation, (14 more...)

Country:

Asia > China > Beijing > Beijing (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsNov-20-2025, 15:32:00 GMT

Chain of Reasoning for Visual Question Answering

Chenfei Wu, Jinlai Liu, Xiaojie Wang, Xuan Dong

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (18 more...)

Country:

Asia > China > Beijing > Beijing (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceDec-2-2024

VQA$^2$: Visual Question Answering for Video Quality Assessment

Jia, Ziheng, Zhang, Zicheng, Qian, Jiaying, Wu, Haoning, Sun, Wei, Li, Chunyi, Liu, Xiaohong, Lin, Weisi, Zhai, Guangtao, Min, Xiongkuo

The advent and proliferation of large multi-modal models (LMMs) have introduced new paradigms to computer vision, transforming various tasks into a unified visual question answering framework. Video Quality Assessment (VQA), a classic field in low-level visual perception, focused initially on quantitative video quality scoring. However, driven by advances in LMMs, it is now progressing toward more holistic visual quality understanding tasks. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can markedly enhance low-level visual quality evaluation. Nevertheless, related work has not been explored in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset - the first visual question answering instruction dataset that focuses on video quality assessment. This dataset consists of 3 subsets and covers various video types, containing 157,755 instruction question-answer pairs. Then, leveraging this foundation, we present the VQA2 series models. The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos. We conduct extensive experiments on video quality scoring and understanding tasks, and results demonstrate that the VQA2series models achieve excellent performance in both tasks. Notably, our final model, the VQA2-Assistant, exceeds the renowned GPT-4o in visual quality understanding tasks while maintaining strong competitiveness in quality scoring tasks. Our work provides a foundation and feasible approach for integrating low-level video quality assessment and understanding with LMMs.

depiction, video, vqa 2, (15 more...)

2411.03795

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Media (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Mi, Li, Montariol, Syrielle, Castillo-Navarro, Javiera, Dai, Xianjie, Bosselut, Antoine, Tuia, Devis

ConVQG: Contrastive Visual Question Generation with Multimodal Guidance

arXiv.org Artificial IntelligenceFeb-20-2024

Asking questions about visual environments is a crucial way for intelligent agents to understand rich multi-faceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enforcing a high relevance to the image content remains a challenge, as VQG systems often ignore one or both forms of grounding. In this work, we propose Contrastive Visual Question Generation (ConVQG), a method using a dual contrastive objective to discriminate questions generated using both modalities from those based on a single one. Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions. Our human evaluation results also show preference for ConVQG questions compared to non-contrastive baselines.

convqg, dataset, knowledge triplet, (14 more...)

2402.12846

Country:

North America > United States (0.14)
Europe > Switzerland (0.04)
Asia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Salaberria, Ander, Azkune, Gorka, de Lacalle, Oier Lopez, Soroa, Aitor, Agirre, Eneko

Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering

arXiv.org Artificial IntelligenceSep-15-2021

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters. In contrast, our model is less effective in a standard VQA task (VQA 2.0) confirming that our text-only method is specially effective for tasks requiring external knowledge. In addition, we show that our unimodal model is complementary to multimodal models in both OK-VQA and VQA 2.0, and yield the best result to date in OK-VQA among systems not using external knowledge graphs, and comparable to systems that do use them. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.

caption, knowledge, ok-vqa, (15 more...)

2109.08029

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Africa (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(7 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.68)

#artificialintelligenceOct-10-2019, 16:01:29 GMT

Microsoft's AI learns to answer questions about scenes from image-text pairs

Machines struggle to make sense of scenes and language without detailed accompanying annotations. Unfortunately, labeling is generally time-consuming and expensive, and even the best labels convey an understanding only of scenes and not of language. In an attempt to remedy the problem, Microsoft researchers conceived of an AI system that trains on image-text pairs in a fashion mimicking the way humans improve their understanding of the world. They say that their single-model encoder-decoder Vision-Language Pre-training (VLP) model, which can both generate image descriptions and answer natural language questions about scenes, lays the groundwork for future frameworks that could reach human parity. A model pretrained using three million image-text pairs is available on GitHub in open source.

answer question, image-text pair, representation, (7 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.89)
Information Technology > Artificial Intelligence > Natural Language (0.76)

arXiv.org Artificial IntelligenceMar-28-2019

Relation-aware Graph Attention Network for Visual Question Answering

Li, Linjie, Gan, Zhe, Cheng, Yu, Liu, Jingjing

In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.

machine learning, question answering, relation, (18 more...)

1903.12314

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Cadene, Remi, Ben-younes, Hedi, Cord, Matthieu, Thome, Nicolas

MUREL: Multimodal Relational Reasoning for Visual Question Answering

arXiv.org Artificial IntelligenceFeb-25-2019

Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasks involving real images. Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks. In this paper, we propose MuRel, a multimodal relational network which is learned end-to-end to reason over real images. Our first contribution is the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations. Secondly, we incorporate the cell into a full MuRel network, which progressively refines visual and question interactions, and can be leveraged to define visualization schemes finer than mere attention maps. We validate the relevance of our approach with various ablation studies, and show its superiority to attention-based methods on three datasets: VQA 2.0, VQA-CP v2 and TDIUC. Our final MuRel network is competitive to or outperforms state-of-the-art results in this challenging context. Our code is available: https://github.com/Cadene/murel.bootstrap.pytorch

machine learning, natural language, question answering, (19 more...)

1902.09487

Country: North America > United States > California > Los Angeles County > Long Beach (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.63)
(2 more...)

Neural Information Processing SystemsDec-31-2018

Chain of Reasoning for Visual Question Answering

Wu, Chenfei, Liu, Jinlai, Wang, Xiaojie, Dong, Xuan

Reasoning plays an essential role in Visual Question Answering (VQA). Multi-step and dynamic reasoning is often necessary for answering complex questions. For example, a question "What is placed next to the bus on the right of the picture?" talks about a compound object "bus on the right," which is generated by the relation . Furthermore, a new relation including this compound object is then required to infer the answer. However, previous methods support either one-step or static reasoning, without updating relations or generating compound objects. This paper proposes a novel reasoning model for addressing these problems. A chain of reasoning (CoR) is constructed for supporting multi-step and dynamic reasoning on changed relations and objects. In detail, iteratively, the relational reasoning operations form new relations between objects, and the object refining operations generate new compound objects from relations. We achieve new state-of-the-art results on four publicly available datasets. The visualization of the chain of reasoning illustrates the progress that the CoR generates new compound objects that lead to the answer of the question step by step.

machine learning, natural language, question answering, (19 more...)

Country:

Asia > China > Beijing > Beijing (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)