visual attribution
Explaining multimodal LLMs via intra-modal token interactions
Liang, Jiawei, Chen, Ruoyu, Jiao, Xianghao, Liang, Siyuan, Liu, Shiming, Zhang, Qunli, Hu, Zheng, Cao, Xiaochun
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.36)
ChartLens: Fine-grained Visual Attribution in Charts
Suri, Manan, Mathur, Puneet, Lipka, Nedim, Dernoncourt, Franck, Rossi, Ryan A., Manocha, Dinesh
The growing capabilities of multimodal large language models (MLLMs) have advanced tasks like chart understanding. However, these models often suffer from hallucinations, where generated text sequences conflict with the provided visual data. To address this, we introduce Post-Hoc Visual Attribution for Charts, which identifies fine-grained chart elements that validate a given chart-associated response. We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects and employs set-of-marks prompting with MLLMs for fine-grained visual attribution. Additionally, we present ChartVA-Eval, a benchmark with synthetic and real-world charts from diverse domains like finance, policy, and economics, featuring fine-grained attribution annotations. Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- Asia (0.04)
- Health & Medicine (0.48)
- Transportation > Air (0.46)
- Information Technology > Visualization (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Ablation Based Counterfactuals
Diffusion models are a class of generative models that generate high-quality samples, but at present it is difficult to characterize how they depend upon their training data. This difficulty raises scientific and regulatory questions, and is a consequence of the complexity of diffusion models and their sampling process. To analyze this dependence, we introduce Ablation Based Counterfactuals (ABC), a method of performing counterfactual analysis that relies on model ablation rather than model retraining. In our approach, we train independent components of a model on different but overlapping splits of a training set. These components are then combined into a single model, from which the causal influence of any training sample can be removed by ablating a combination of model components. We demonstrate how we can construct a model like this using an ensemble of diffusion models. We then use this model to study the limits of training data attribution by enumerating full counterfactual landscapes, and show that single source attributability diminishes with increasing training data size. Finally, we demonstrate the existence of unattributable samples.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
VALD-MD: Visual Attribution via Latent Diffusion for Medical Diagnostics
Siddiqui, Ammar A., Tirunagari, Santosh, Zia, Tehseen, Windridge, David
Visual attribution in medical imaging seeks to make evident the diagnostically-relevant components of a medical image, in contrast to the more common detection of diseased tissue deployed in standard machine vision pipelines (which are less straightforwardly interpretable/explainable to clinicians). We here present a novel generative visual attribution technique, one that leverages latent diffusion models in combination with domain-specific large language models, in order to generate normal counterparts of abnormal images. The discrepancy between the two hence gives rise to a mapping indicating the diagnostically-relevant image components. To achieve this, we deploy image priors in conjunction with appropriate conditioning mechanisms in order to control the image generative process, including natural language text prompts acquired from medical science and applied radiology. We perform experiments and quantitatively evaluate our results on the COVID-19 Radiography Database containing labelled chest X-rays with differing pathologies via the Frechet Inception Distance (FID), Structural Similarity (SSIM) and Multi Scale Structural Similarity Metric (MS-SSIM) metrics obtained between real and generated images. The resulting system also exhibits a range of latent capabilities including zero-shot localized disease induction, which are evaluated with real examples from the cheXpert dataset.
- North America > Canada > Quebec > Capitale-Nationale Region > Québec (0.04)
- North America > Canada > Quebec > Capitale-Nationale Region > Quebec City (0.04)
- Asia > Singapore (0.04)
- (7 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
MAEA: Multimodal Attribution for Embodied AI
Jain, Vidhi, Tamarapalli, Jayant Sravan, Yerramilli, Sahiti, Bisk, Yonatan
Understanding multimodal perception for embodied AI is an open question because such inputs may contain highly complementary as well as redundant information for the task. A relevant direction for multimodal policies is understanding the global trends of each modality at the fusion layer. To this end, we disentangle the attributions for visual, language, and previous action inputs across different policies trained on the ALFRED dataset. Attribution analysis can be utilized to rank and group the failure scenarios, investigate modeling and dataset biases, and critically analyze multimodal EAI policies for robustness and user trust before deployment. We present MAEA, a framework to compute global attributions per modality of any differentiable policy. In addition, we show how attributions enable lower-level behavior analysis in EAI policies for language and visual attributions.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Italy > Marche > Ancona Province > Ancona (0.04)
- Research Report (0.40)
- Overview (0.34)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN
Yudistira, Novanto, Kavitha, Muthu Subash, Kurita, Takio
3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; i) aggregate layer-wise global to local (global-local) discrete gradients using trained 3DResNext network, and ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global-local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradients and activations of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class's input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCam. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating on each layer produces better classification results than the baseline model.
- Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.05)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- Asia > Indonesia (0.04)