lxmert
Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning
Islam, Chashi Mahiul, Mamo, Oteo, Chacko, Samuel Jacob, Liu, Xiuwen, Yu, Weikuan
Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework. This approach enriches multimodal embeddings with spatial understanding. We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively. Additionally, SpatialEnsemble combines both approaches, achieving state-of-the-art accuracy. Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a significant step in enhancing the spatial intelligence of AI systems, crucial for advanced multimodal understanding and real-world applications.
- Oceania > New Zealand (0.04)
- North America > United States > Florida > Leon County > Tallahassee (0.04)
A Feature Importance Explanation Methods
We briefly review several FI explanation methods and explain how they are used in this paper. This method follows SHAP exactly except for the use of a regression. To estimate a feature's importance, we aim to compute the expected difference between model For the full sequential tuning process across all hyperparameters, see Appendix F. We consider five different The Shuffle function shuffles elements of the input representation across all bounding boxes that need replacement within one sample (within and across bounding boxes). The resulting explanation is differentiable w.r.t. Below, we describe how the compute budget can vary for each method.
Can Argus Judge Them All? Comparing VLMs Across Domains
Joshi, Harsh, Kashyap, Gautam Siddharth, Ali, Rafiq, Shabbir, Ebad, Jain, Niharika, Jain, Sarthak, Gao, Jiechao, Naseem, Usman
Vision-Language Models (VLMs) are advancing multimodal AI, yet their performance consistency across tasks is underexamined. We benchmark CLIP, BLIP, and LXMERT across diverse datasets spanning retrieval, captioning, and reasoning. Our evaluation includes task accuracy, generation quality, efficiency, and a novel Cross-Dataset Consistency (CDC) metric. CLIP shows strongest generalization (CDC: 0.92), BLIP excels on curated data, and LXMERT leads in structured reasoning. These results expose trade-offs between generalization and specialization, informing industrial deployment of VLMs and guiding development toward robust, task-flexible architectures.
- Asia > India > NCT > Delhi (0.05)
- Asia > India > NCT > New Delhi (0.05)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (2 more...)
- Health & Medicine (0.60)
- Information Technology (0.46)
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA
Lan, Jian, Frassinelli, Diego, Plank, Barbara
Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3's ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, better aligning model confidence with human uncertainty. Our findings highlight that for VQA, the consistent alignment between human responses and model predictions is understudied and should become the next crucial target of future studies.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models
Bavaresco, Anna, Kloots, Marianne de Heer, Pezzelle, Sandro, Fernández, Raquel
Representations from deep neural networks (DNNs) have proven remarkably predictive of neural activity involved in both visual and linguistic processing. Despite these successes, most studies to date concern unimodal DNNs, encoding either visual or textual input but not both. Yet, there is growing evidence that human meaning representations integrate linguistic and sensory-motor information. Here we investigate whether the integration of multimodal information operated by current vision-and-language DNN models (VLMs) leads to representations that are more aligned with human brain activity than those obtained by language-only and vision-only DNNs. We focus on fMRI responses recorded while participants read concept words in the context of either a full sentence or an accompanying picture. Our results reveal that VLM representations correlate more strongly than language- and vision-only DNNs with activations in brain areas functionally related to language processing. A comparison between different types of visuo-linguistic architectures shows that recent generative VLMs tend to be less brain-aligned than previous architectures with lower performance on downstream applications. Moreover, through an additional analysis comparing brain vs. behavioural alignment across multiple VLMs, we show that -- with one remarkable exception -- representations that strongly align with behavioural judgments do not correlate highly with brain responses. This indicates that brain similarity does not go hand in hand with behavioural similarity, and vice versa.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (0.90)
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
Beňová, Ivana, Košecká, Jana, Gregor, Michal, Tamajka, Martin, Veselý, Marcel, Šimko, Marián
The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called guided masking. The proposed approach ablates different modalities using masking and assesses Figure 1: Image from the SVO-Probes dataset (Hendricks the model's ability to predict the masked word and Nematzadeh, 2021). It consists of imagecaption with high accuracy. We focus on studying pairs, where the sentence either correctly describes multimodal models that consider regions of the image (positive example) or one aspect of interest (ROI) features obtained by object detectors the sentence (subject, verb, or object) does not match as input tokens. We probe the understanding the image (negative example). These pairs are used to of verbs using guided masking on probe models through zero-shot image-text matching. ViLBERT, LXMERT, UNITER, and Visual-Example of a positive caption: A person walking on BERT and show that these models can predict a trail.
- Europe > Czechia > South Moravian Region > Brno (0.04)
- North America > United States (0.04)
- Europe > Slovakia > Bratislava > Bratislava (0.04)
Object Attribute Matters in Visual Question Answering
Li, Peize, Si, Qingyi, Fu, Peng, Lin, Zheng, Wang, Yan
Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.
- Asia > China > Jilin Province > Changchun (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.75)
- Information Technology > Artificial Intelligence > Vision > Image Understanding (0.56)
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
With pre-training of vision-and-language models (VLMs) on large-scale datasets of image-text pairs, several recent works showed that these pre-trained models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the ability of these models to understand spatial relations. Previously, this has been tackled using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we use explainability tools to understand the causes of poor performance better and present an alternative fine-grained, compositional approach for ranking spatial clauses. We combine the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.