AITopics

Country:

North America > United States (0.46)
Asia > China (0.28)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.45)

Industry:

Health & Medicine (0.46)
Education (0.46)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Neural Information Processing SystemsFeb-7-2026, 23:15:44 GMT

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Ge Zheng 1 Bin Y ang 1 Jiajin T ang 1 Hong-Y u Zhou 2

As shown in Figure 1(d) and (e), basic science questions that are out-of-distribution pose a challenge for these models.

large language model, machine learning, natural language, (18 more...)

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Israel (0.04)
Asia > China > Hong Kong (0.04)

Genre:

Research Report > Promising Solution (0.46)
Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Neural Information Processing SystemsDec-25-2025, 05:01:50 GMT

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks

Visual object recognition has been extensively studied in both neuroscience and computer vision. Recently, the most popular class of artificial systems for this task, deep convolutional neural networks (CNNs), has been shown to provide excellent models for its functional analogue in the brain, the ventral stream in visual cortex. This has prompted questions on what, if any, are the common principles underlying the reformatting of visual information as it flows through a CNN or the ventral stream. Here we consider some prominent statistical patterns that are known to exist in the internal representations of either CNNs or the visual cortex and look for them in the other system. We show that intrinsic dimensionality (ID) of object representations along the rat homologue of the ventral stream presents two distinct expansion-contraction phases, as previously shown for CNNs. Conversely, in CNNs, we show that training results in both distillation and active pruning (mirroring the increase in ID) of low-to middle-level image information in single units, as representations gain the ability to support invariant discrimination, in agreement with previous observations in rat visual cortex. Taken together, our findings suggest that CNNs and visual cortex share a similarly tight relationship between dimensionality expansion/reduction of object representations and reformatting of image information.

image information, prune and distill, rat visual cortex, (6 more...)

Genre: Research Report > New Finding (0.59)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.39)

arXiv.org Artificial IntelligenceSep-23-2025

Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models

Kim, Jinyeong, Kang, Seil, Park, Jiwoo, Kim, Junhyeok, Hwang, Seong Jae

Large Vision-Language Models (LVLMs) answer visual questions by transferring information from images to text through a series of attention heads. While this image-to-text information flow is central to visual question answering, its underlying mechanism remains difficult to interpret due to the simultaneous operation of numerous attention heads. To address this challenge, we propose head attribution, a technique inspired by component attribution methods, to identify consistent patterns among attention heads that play a key role in information transfer. Using head attribution, we investigate how LVLMs rely on specific attention heads to identify and answer questions about the main object in an image. Our analysis reveals that a distinct subset of attention heads facilitates the image-to-text information flow. Remarkably, we find that the selection of these heads is governed by the semantic content of the input image rather than its visual appearance. We further examine the flow of information at the token level and discover that (1) text information first propagates to role-related tokens and the final token before receiving image information, and (2) image information is embedded in both object-related and background tokens. Our work provides evidence that image-to-text information flow follows a structured process, and that analysis at the attention-head level offers a promising direction toward understanding the mechanisms of LVLMs.

large language model, machine learning, natural language, (17 more...)

2509.17588

Country: Europe > Switzerland (0.28)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsAug-18-2025, 16:51:39 GMT

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks

artificial intelligence, information, machine learning, (19 more...)

Country:

Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Rajaram, Achyuta, Schwettmann, Sarah, Andreas, Jacob, Conmy, Arthur

Line of Sight: On Linear Representations in VLLMs

arXiv.org Artificial IntelligenceJun-6-2025

Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.

artificial intelligence, machine learning, natural language, (18 more...)

2506.04706

Genre: Research Report (0.64)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

arXiv.org Artificial IntelligenceFeb-27-2025

SalM$^{2}$: An Extremely Lightweight Saliency Mamba Model for Real-Time Cognitive Awareness of Driver Attention

Zhao, Chunyu, Mu, Wentao, Zhou, Xian, Liu, Wenbo, Yan, Fei, Deng, Tao

Driver attention recognition in driving scenarios is a popular direction in traffic scene perception technology. It aims to understand human driver attention to focus on specific targets/objects in the driving scene. However, traffic scenes contain not only a large amount of visual information but also semantic information related to driving tasks. Existing methods lack attention to the actual semantic information present in driving scenes. Additionally, the traffic scene is a complex and dynamic process that requires constant attention to objects related to the current driving task. Existing models, influenced by their foundational frameworks, tend to have large parameter counts and complex structures. Therefore, this paper proposes a real-time saliency Mamba network based on the latest Mamba framework. As shown in Figure 1, our model uses very few parameters (0.08M, only 0.09~11.16% of other models), while maintaining SOTA performance or achieving over 98% of the SOTA model's performance.

driver attention, information, semantic information, (14 more...)

2502.16214

Country:

Oceania > Australia > Western Australia > Perth (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)

Genre: Research Report (1.00)

Industry: Transportation > Ground > Road (0.47)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Neural Information Processing SystemsJan-20-2025, 16:27:45 GMT

Reviews: Hierarchical Question-Image Co-Attention for Visual Question Answering

The paper presents an incremental contribution with respect to previous methods for VQA that only exploit an image attention mechanism guided by question data. Here, they also consider a question attention mechanism guided by image information. In this sense, the main hypothesis of this work is that jointly considering visual and question attention mechanisms can improve the performance of current VQA systems. I agree that this hypothesis can be relevant for the case of long questions, but I believe there is also a risk that question based attention guided by image information can be misleading, in the sense that usually an image includes several information sources, while the question is more focused. In Figure 3, authors include a graph that shows the impact of question length in performance, while this figure seems to show a tendency, the effect is still weak, maybe a numerical analysis can help to support this point. I believe, an analysis of potential differences (not only question length) between most common errors of previous works (only image attention) and the proposed approach (image and question attention) can help to support the relevance of the proposed attention mechanism.

attention mechanism, hierarchical question-image co-attention, information, (8 more...)

Genre: Research Report (0.52)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.40)

Neural Information Processing SystemsJan-18-2025, 19:39:39 GMT

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks

cortex and deep neural network, image information, rat visual cortex, (4 more...)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

arXiv.org Artificial IntelligenceDec-23-2024

Ensuring Consistency for In-Image Translation

Fu, Chengpeng, Feng, Xiaocheng, Huang, Yichong, Huo, Wenshuai, Li, Baohang, Zhang, Zhirui, Lu, Yunfei, Tu, Dandan, Tang, Duyu, Wang, Hui, Qin, Bing, Liu, Ting

The in-image machine translation task involves translating text embedded within images, with the translated results presented in image format. While this task has numerous applications in various scenarios such as film poster translation and everyday scene image translation, existing methods frequently neglect the aspect of consistency throughout this process. We propose the need to uphold two types of consistency in this task: translation consistency and image generation consistency. The former entails incorporating image information during translation, while the latter involves maintaining consistency between the style of the text-image and the original image, ensuring background integrity. To address these consistency requirements, we introduce a novel two-stage framework named HCIIT (High-Consistency In-Image Translation) which involves text-image translation using a multimodal multilingual large language model in the first stage and image backfilling with a diffusion model in the second stage. Chain of thought learning is utilized in the first stage to enhance the model's ability to leverage image information during translation. Subsequently, a diffusion model trained for style-consistent text-image generation ensures uniformity in text style within images and preserves background details. A dataset comprising 400,000 style-consistent pseudo text-image pairs is curated for model training. Results obtained on both curated test sets and authentic image test sets validate the effectiveness of our framework in ensuring consistency and producing high-quality translated images.

machine learning, natural language, translation, (18 more...)

2412.18139

Country: Asia > China (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)