Question Answering
ProtoVQA: An Adaptable Prototypical Framework for Explainable Fine-Grained Visual Question Answering
Diao, Xingjian, Wu, Weiyi, Kong, Keyi, Qing, Peijun, Xu, Xinwen, Cheng, Ming, Vosoughi, Soroush, Gui, Jiang
Visual Question Answering (VQA) is increasingly used in diverse applications ranging from general visual reasoning to safety-critical domains such as medical imaging and autonomous systems, where models must provide not only accurate answers but also explanations that humans can easily understand and verify. Prototype-based modeling has shown promise for interpretability by grounding predictions in semantically meaningful regions for purely visual reasoning tasks, yet remains underexplored in the context of VQA. We present ProtoVQA, a unified prototypical framework that (i) learns question-aware prototypes that serve as reasoning anchors, connecting answers to discriminative image regions, (ii) applies spatially constrained matching to ensure that the selected evidence is coherent and semantically relevant, and (iii) supports both answering and grounding tasks through a shared prototype backbone. To assess explanation quality, we propose the Visual-Linguistic Alignment Score (VLAS), which measures how well the model's attended regions align with ground-truth evidence. Experiments on Visual7W show that ProtoVQA yields faithful, fine-grained explanations while maintaining competitive accuracy, advancing the development of transparent and trustworthy VQA systems.
When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs
Penamakuri, Abhirama Subramanyam, Singh, Navlika, Arora, Piyush, Mishra, Anand
Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks, including visual question answering (VQA). However, their high computational cost makes them impractical for resource-constrained settings and inference-heavy applications. In contrast, Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts. In this work, we introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs by leveraging unlabeled images and effective knowledge transfer from L-VLMs. Instead of traditional knowledge distillation methods that rely on labeled training data, MPA employs a strategic parity-based approach that precisely identifies the knowledge disparities between S-VLMs and L-VLMs, and optimizes training by targeting only these disparities. We conduct extensive experiments on four diverse VQA benchmarks, namely TextVQA, ST-VQA, ChartQA, and OKVQA, each of which requires specialized reasoning capabilities such as text recognition, chart interpretation, and commonsense and factual understanding. Our results demonstrate that MPA consistently enhances the performance of S-VLMs on all benchmarks, reducing the performance gap while maintaining computational efficiency. We make our code publicly available.
From Chat Logs to Collective Insights: Aggregative Question Answering
Zhang, Wentao, Kim, Woojeong, Deng, Yuntian
Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet, existing approaches typically treat these interactions as independent and miss critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregative queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.
MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs
Hu, Yiheng, Wang, Xiaoyang, Liu, Qing, Xu, Xiwei, Fu, Qian, Zhang, Wenjie, Zhu, Liming
Multimodal Multi-hop question answering requires integrating information from diverse sources, such as images and texts, to derive answers. Existing methods typically rely on sequential retrieval and reasoning, where each step builds on the previous output. However, this single-path paradigm makes them vulnerable to errors due to misleading intermediate steps. Moreover, developing multimodal models can be computationally expensive, often requiring extensive training. To address these limitations, we propose a training-free framework guided by an Adaptive Planning Graph, which consists of planning, retrieval and reasoning modules. The planning module analyzes the current state of the Adaptive Planning Graph, determines the next action and where to expand the graph, which enables dynamic and flexible exploration of reasoning paths. To handle retrieval of text to unspecified target modalities, we devise modality-specific strategies that dynamically adapt to distinct data types. Our approach preserves the characteristics of multimodal information without costly task-specific training, enabling seamless integration with up-to-date models. Finally, the experiments on MultimodalQA and WebQA show that our approach matches or outperforms existing models that rely on training.
Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering
Zhao, Jinghua, Su, Hang, Fan, Lichun, Luo, Zhenbo, Wang, Hui, Sun, Haoqin, Qin, Yong
ABSTRACT With the rapid progress of large audio-language models (LALMs), audio question answering (AQA) has emerged as a challenging task requiring both fine-grained audio understanding and complex reasoning. While current methods mainly rely on constructing new datasets via captioning or reasoning traces, existing high-quality AQA data remains underutilized. To address this, we propose Omni-CLST, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought. The framework efficiently leverages existing high-quality dataset through two key strategies: an error-aware curriculum that organizes samples by difficulty, and a guided thought dropout mechanism that focuses reasoning on challenging cases. Experiments show that Omni-CLST achieves 73.80% on MMAU-mini and a new state of the art of 64.30% on MMAR, demonstrating robust generalization in multimodal audio-language understanding.
ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement
Salamatian, Ali, Abaskohi, Amirhossein, Fan, Wan-Cyuan, Hossain, Mir Rayat Imtiaz, Sigal, Leonid, Carenini, Giuseppe
Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.
Toward Ownership Understanding of Objects: Active Question Generation with Large Language Model and Probabilistic Generative Model
Hashimoto, Saki, Hasegawa, Shoichi, Ishikawa, Tomochika, Taniguchi, Akira, Hagiwara, Yoshinobu, Hafi, Lotfi El, Taniguchi, Tadahiro
Robots operating in daily life environments must understand object ownership to carry out instructions naturally given by users, such as "Bring me my cup." Without ownership knowledge, a robot cannot determine which object is being referred to when multiple similar objects exist. This problem is especially evident in kitchens, offices, or laboratories, where objects with similar appearances may belong to different individuals. Relying solely on perceptual features such as location or appearance is insufficient because ownership is inherently context-dependent and often determined by social conventions. Therefore, enabling robots to acquire ownership knowledge is a crucial step toward socially appropriate human-robot interaction. To enable robots to learn object ownership in daily life environments, it is essential to implement a question-generation mechanism that efficiently acquires necessary information. However, in real-world environments with large numbers of objects, this is impractical and imposes a heavy burden on users. Although robots can explore the environment to collect visual features of objects, it remains difficult to obtain ownership knowledge because it depends on users and context. Therefore, allowing robots to ask questions based on the current situation enables them to acquire ownership knowl-Saki Hashimoto is the presenter of this paper.
Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
Jian, Pu, Yu, Donglei, Yang, Wen, Ren, Shuo, Zhang, Jiajun
In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
Ma, Haodi, Pathak, Vyom, Wang, Daisy Zhe
Video Question Answering (VQA) requires models to reason over spatial, temporal, and causal cues in videos. Recent vision language models (VLMs) achieve strong results but often rely on shallow correlations, leading to weak temporal grounding and limited interpretability. We study symbolic scene graphs (SGs) as intermediate grounding signals for VQA. SGs provide structured object-relation representations that complement VLMs holistic reasoning. We introduce SG-VLM, a modular framework that integrates frozen VLMs with scene graph grounding via prompting and visual localization. Across three benchmarks (NExT-QA, iVQA, ActivityNet-QA) and multiple VLMs (QwenVL, InternVL), SG-VLM improves causal and temporal reasoning and outperforms prior baselines, though gains over strong VLMs are limited. These findings highlight both the promise and current limitations of symbolic grounding, and offer guidance for future hybrid VLM-symbolic approaches in video understanding.
Towards Reliable and Interpretable Document Question Answering via VLMs
Chen, Alessio, Giovannini, Simone, Gemelli, Andrea, Coppini, Fabio, Marinai, Simone
Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. T o address this, we introduce DocExplainerV0, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.