AITopics | lxmert

Collaborating Authors

lxmert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

6cdb2cbb2083477cca5243843d6dad06-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 15:46:36 GMT

accuracy, explanation, objective, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

13f320e7b5ead1024ac95c3b208610db-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-7-2026, 13:43:49 GMT

annotation, artificial intelligence, reviewer, (15 more...)

Neural Information Processing Systems

Industry: Transportation (0.33)

Technology: Information Technology > Artificial Intelligence > Vision (0.49)

Add feedback

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Islam, Chashi Mahiul, Mamo, Oteo, Chacko, Samuel Jacob, Liu, Xiuwen, Yu, Weikuan

arXiv.org Artificial IntelligenceOct-7-2025

Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework. This approach enriches multimodal embeddings with spatial understanding. We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively. Additionally, SpatialEnsemble combines both approaches, achieving state-of-the-art accuracy. Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a significant step in enhancing the spatial intelligence of AI systems, crucial for advanced multimodal understanding and real-world applications.

artificial intelligence, machine learning, spatial reasoning, (14 more...)

arXiv.org Artificial Intelligence

2510.03441

Country:

Oceania > New Zealand (0.04)
North America > United States > Florida > Leon County > Tallahassee (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

A Feature Importance Explanation Methods

Neural Information Processing SystemsAug-15-2025, 15:57:05 GMT

We briefly review several FI explanation methods and explain how they are used in this paper. This method follows SHAP exactly except for the use of a regression. To estimate a feature's importance, we aim to compute the expected difference between model For the full sequential tuning process across all hyperparameters, see Appendix F. We consider five different The Shuffle function shuffles elements of the input representation across all bounding boxes that need replacement within one sample (within and across bounding boxes). The resulting explanation is differentiable w.r.t. Below, we describe how the compute budget can vary for each method.

accuracy, explanation, objective, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Can Argus Judge Them All? Comparing VLMs Across Domains

Joshi, Harsh, Kashyap, Gautam Siddharth, Ali, Rafiq, Shabbir, Ebad, Jain, Niharika, Jain, Sarthak, Gao, Jiechao, Naseem, Usman

arXiv.org Artificial IntelligenceJul-3-2025

Vision-Language Models (VLMs) are advancing multimodal AI, yet their performance consistency across tasks is underexamined. We benchmark CLIP, BLIP, and LXMERT across diverse datasets spanning retrieval, captioning, and reasoning. Our evaluation includes task accuracy, generation quality, efficiency, and a novel Cross-Dataset Consistency (CDC) metric. CLIP shows strongest generalization (CDC: 0.92), BLIP excels on curated data, and LXMERT leads in structured reasoning. These results expose trade-offs between generalization and specialization, informing industrial deployment of VLMs and guiding development toward robust, task-flexible architectures.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.01042

Country:

Asia > India > NCT > Delhi (0.05)
Asia > India > NCT > New Delhi (0.05)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry:

Health & Medicine (0.60)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

Lan, Jian, Frassinelli, Diego, Plank, Barbara

arXiv.org Artificial IntelligenceSep-17-2024

Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3's ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, better aligning model confidence with human uncertainty. Our findings highlight that for VQA, the consistent alignment between human responses and model predictions is understudied and should become the next crucial target of future studies.

human uncertainty, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2410.02773

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models

Bavaresco, Anna, Kloots, Marianne de Heer, Pezzelle, Sandro, Fernández, Raquel

arXiv.org Artificial IntelligenceJul-25-2024

Representations from deep neural networks (DNNs) have proven remarkably predictive of neural activity involved in both visual and linguistic processing. Despite these successes, most studies to date concern unimodal DNNs, encoding either visual or textual input but not both. Yet, there is growing evidence that human meaning representations integrate linguistic and sensory-motor information. Here we investigate whether the integration of multimodal information operated by current vision-and-language DNN models (VLMs) leads to representations that are more aligned with human brain activity than those obtained by language-only and vision-only DNNs. We focus on fMRI responses recorded while participants read concept words in the context of either a full sentence or an accompanying picture. Our results reveal that VLM representations correlate more strongly than language- and vision-only DNNs with activations in brain areas functionally related to language processing. A comparison between different types of visuo-linguistic architectures shows that recent generative VLMs tend to be less brain-aligned than previous architectures with lower performance on downstream applications. Moreover, through an additional analysis comparing brain vs. behavioural alignment across multiple VLMs, we show that -- with one remarkable exception -- representations that strongly align with behavioural judgments do not correlate highly with brain responses. This indicates that brain similarity does not go hand in hand with behavioural similarity, and vice versa.

computational linguistic, correlation, representation, (16 more...)

arXiv.org Artificial Intelligence

2407.17914

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.90)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)

Add feedback

Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking

Beňová, Ivana, Košecká, Jana, Gregor, Michal, Tamajka, Martin, Veselý, Marcel, Šimko, Marián

arXiv.org Artificial IntelligenceJan-29-2024

The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called guided masking. The proposed approach ablates different modalities using masking and assesses Figure 1: Image from the SVO-Probes dataset (Hendricks the model's ability to predict the masked word and Nematzadeh, 2021). It consists of imagecaption with high accuracy. We focus on studying pairs, where the sentence either correctly describes multimodal models that consider regions of the image (positive example) or one aspect of interest (ROI) features obtained by object detectors the sentence (subject, verb, or object) does not match as input tokens. We probe the understanding the image (negative example). These pairs are used to of verbs using guided masking on probe models through zero-shot image-text matching. ViLBERT, LXMERT, UNITER, and Visual-Example of a positive caption: A person walking on BERT and show that these models can predict a trail.

caption, dataset, prediction, (14 more...)

arXiv.org Artificial Intelligence

2401.16575

Country:

Europe > Czechia > South Moravian Region > Brno (0.04)
North America > United States (0.04)
Europe > Slovakia > Bratislava > Bratislava (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

Object Attribute Matters in Visual Question Answering

Li, Peize, Si, Qingyi, Fu, Peng, Lin, Zheng, Wang, Yan

arXiv.org Artificial IntelligenceDec-20-2023

Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.

dataset, information, knowledge, (17 more...)

arXiv.org Artificial Intelligence

2401.09442

Country:

Asia > China > Jilin Province > Changchun (0.04)
Asia > China > Beijing > Beijing (0.04)
Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.75)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.56)

Add feedback

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Rajabi, Navid, Kosecka, Jana

arXiv.org Artificial IntelligenceDec-6-2023

With pre-training of vision-and-language models (VLMs) on large-scale datasets of image-text pairs, several recent works showed that these pre-trained models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the ability of these models to understand spatial relations. Previously, this has been tackled using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we use explainability tools to understand the causes of poor performance better and present an alternative fine-grained, compositional approach for ranking spatial clauses. We combine the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.

artificial intelligence, natural language, text processing, (19 more...)

arXiv.org Artificial Intelligence

2308.09778

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.71)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.50)

Add feedback