AITopics | visual attribution

Collaborating Authors

visual attribution

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Explaining multimodal LLMs via intra-modal token interactions

Liang, Jiawei, Chen, Ruoyu, Jiao, Xianghao, Liang, Siyuan, Liu, Shiming, Zhang, Qunli, Hu, Zheng, Cao, Xiaochun

arXiv.org Artificial IntelligenceOct-2-2025

Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.

attribution, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.22415

Country: Europe > Switzerland (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Add feedback

ChartLens: Fine-grained Visual Attribution in Charts

Suri, Manan, Mathur, Puneet, Lipka, Nedim, Dernoncourt, Franck, Rossi, Ryan A., Manocha, Dinesh

arXiv.org Artificial IntelligenceMay-27-2025

The growing capabilities of multimodal large language models (MLLMs) have advanced tasks like chart understanding. However, these models often suffer from hallucinations, where generated text sequences conflict with the provided visual data. To address this, we introduce Post-Hoc Visual Attribution for Charts, which identifies fine-grained chart elements that validate a given chart-associated response. We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects and employs set-of-marks prompting with MLLMs for fine-grained visual attribution. Additionally, we present ChartVA-Eval, a benchmark with synthetic and real-world charts from diverse domains like finance, policy, and economics, featuring fine-grained attribution annotations. Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.

attribution, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2505.1936

Country: North America > United States (0.46)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.48)
Transportation > Air (0.46)

Technology:

Information Technology > Visualization (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ablation Based Counterfactuals

Dai, Zheng, Gifford, David K

arXiv.org Machine LearningJun-12-2024

Diffusion models are a class of generative models that generate high-quality samples, but at present it is difficult to characterize how they depend upon their training data. This difficulty raises scientific and regulatory questions, and is a consequence of the complexity of diffusion models and their sampling process. To analyze this dependence, we introduce Ablation Based Counterfactuals (ABC), a method of performing counterfactual analysis that relies on model ablation rather than model retraining. In our approach, we train independent components of a model on different but overlapping splits of a training set. These components are then combined into a single model, from which the causal influence of any training sample can be removed by ablating a combination of model components. We demonstrate how we can construct a model like this using an ensemble of diffusion models. We then use this model to study the limits of training data attribution by enumerating full counterfactual landscapes, and show that single source attributability diminishes with increasing training data size. Finally, we demonstrate the existence of unattributable samples.

attribution, data source, ensemble, (14 more...)

arXiv.org Machine Learning

2406.07908

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

VALD-MD: Visual Attribution via Latent Diffusion for Medical Diagnostics

Siddiqui, Ammar A., Tirunagari, Santosh, Zia, Tehseen, Windridge, David

arXiv.org Artificial IntelligenceJan-2-2024

Visual attribution in medical imaging seeks to make evident the diagnostically-relevant components of a medical image, in contrast to the more common detection of diseased tissue deployed in standard machine vision pipelines (which are less straightforwardly interpretable/explainable to clinicians). We here present a novel generative visual attribution technique, one that leverages latent diffusion models in combination with domain-specific large language models, in order to generate normal counterparts of abnormal images. The discrepancy between the two hence gives rise to a mapping indicating the diagnostically-relevant image components. To achieve this, we deploy image priors in conjunction with appropriate conditioning mechanisms in order to control the image generative process, including natural language text prompts acquired from medical science and applied radiology. We perform experiments and quantitatively evaluate our results on the COVID-19 Radiography Database containing labelled chest X-rays with differing pathologies via the Frechet Inception Distance (FID), Structural Similarity (SSIM) and Multi Scale Structural Similarity Metric (MS-SSIM) metrics obtained between real and generated images. The resulting system also exhibits a range of latent capabilities including zero-shot localized disease induction, which are evaluated with real examples from the cheXpert dataset.

counterfactual, diffusion model, visual attribution, (12 more...)

arXiv.org Artificial Intelligence

2401.01414

Country:

North America > Canada > Quebec > Capitale-Nationale Region > Québec (0.04)
North America > Canada > Quebec > Capitale-Nationale Region > Quebec City (0.04)
Asia > Singapore (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

MAEA: Multimodal Attribution for Embodied AI

Jain, Vidhi, Tamarapalli, Jayant Sravan, Yerramilli, Sahiti, Bisk, Yonatan

arXiv.org Artificial IntelligenceJul-25-2023

Understanding multimodal perception for embodied AI is an open question because such inputs may contain highly complementary as well as redundant information for the task. A relevant direction for multimodal policies is understanding the global trends of each modality at the fusion layer. To this end, we disentangle the attributions for visual, language, and previous action inputs across different policies trained on the ALFRED dataset. Attribution analysis can be utilized to rank and group the failure scenarios, investigate modeling and dataset biases, and critically analyze multimodal EAI policies for robustness and user trust before deployment. We present MAEA, a framework to compute global attributions per modality of any differentiable policy. In addition, we show how attributions enable lower-level behavior analysis in EAI policies for language and visual attributions.

attribution, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2307.1385

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Italy > Marche > Ancona Province > Ancona (0.04)

Genre:

Research Report (0.40)
Overview (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN

Yudistira, Novanto, Kavitha, Muthu Subash, Kurita, Takio

arXiv.org Artificial IntelligenceDec-17-2020

3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; i) aggregate layer-wise global to local (global-local) discrete gradients using trained 3DResNext network, and ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global-local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradients and activations of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class's input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCam. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating on each layer produces better classification results than the baseline model.

action recognition, attribution, recognition, (15 more...)

arXiv.org Artificial Intelligence

2012.09542

Country:

Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.05)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
Asia > Indonesia (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback