AITopics | multimodal transformer

Brain encoding models based on multimodal transformers can transfer across language and vision

Neural Information Processing SystemsMay-1-2026, 03:37:22 GMT

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Cognitive Science > Neuroscience (0.66)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

3c8a49145944fed2bbcaade178a426c4-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 12:57:58 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

abb4847bbd60f38b1b7649d26c7a0067-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 11:43:25 GMT

modality, modality combination, unseen modality combination, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Brain encoding models based on multimodal transformers can transfer across language and vision

Neural Information Processing SystemsFeb-12-2026, 11:25:35 GMT

Encoding models have been used to assess how the human brain represents concepts in language and vision.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.94)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.72)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
(2 more...)

Add feedback

3c8a49145944fed2bbcaade178a426c4-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 07:25:40 GMT

instruction, navigation, transformer, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Oregon (0.04)
Asia > India (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Brain encoding models based on multimodal transformers can transfer across language and vision

Neural Information Processing SystemsDec-25-2025, 13:48:06 GMT

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.

language and vision, multimodal transformer, representation, (5 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.60)

Industry: Health & Medicine (0.60)

Technology: Information Technology > Artificial Intelligence (0.40)

Add feedback

abb4847bbd60f38b1b7649d26c7a0067-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 04:24:37 GMT

In Table 4 in the main paper, we summarized the multimedia retrieval results with the Mean Rank (MnR) averaged between video-to-text and text-to-video. In the main paper, we divide the tokens in half for the dual branches. Here, we test the model's performance with different partition strategies on EPIC-Kitchens and report the results in Table 6c . Default settings are shaded in gray . We further verify this claim by computing the feature distance between modalities on EPIC-Kitchens with the variants of our model used in Table 2 of the main paper.

artificial intelligence, machine learning, modality combination, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs

Jin, Yijie, Peng, Junjie, Lin, Xuanchao, Yuan, Haochen, Wang, Lan, Zheng, Cangzhi

arXiv.org Artificial IntelligenceAug-25-2025

Multimodal Sentiment Analysis (MSA) is a rapidly developing field that integrates multimodal information to recognize sentiments, and existing models have made significant progress in this area. The central challenge in MSA is multimodal fusion, which is predominantly addressed by Multimodal Transformers (MulTs). Although act as the paradigm, MulTs suffer from efficiency concerns. In this work, from the perspective of efficiency optimization, we propose and prove that MulTs are hierarchical modal-wise heterogeneous graphs (HMHGs), and we introduce the graph-structured representation pattern of MulTs. Based on this pattern, we propose an Interlaced Mask (IM) mechanism to design the Graph-Structured and Interlaced-Masked Multimodal Transformer (GsiT). It is formally equivalent to MulTs which achieves an efficient weight-sharing mechanism without information disorder through IM, enabling All-Modal-In-One fusion with only 1/3 of the parameters of pure MulTs. A Triton kernel called Decomposition is implemented to ensure avoiding additional computational overhead. Moreover, it achieves significantly higher performance than traditional MulTs. To further validate the effectiveness of GsiT itself and the HMHG concept, we integrate them into multiple state-of-the-art models and demonstrate notable performance improvements and parameter reduction on widely used MSA datasets.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.01068

Country:

Europe (0.93)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

IsoNet: Causal Analysis of Multimodal Transformers for Neuromuscular Gesture Classification

Tyacke, Eion, Gupta, Kunal, Patel, Jay, Li, Rui

arXiv.org Artificial IntelligenceJun-23-2025

Hand gestures are a primary output of the human motor system, yet the decoding of their neuromuscular signatures remains a bottleneck for basic neuroscience and assistive technologies such as prosthetics. Traditional human-machine interface pipelines rely on a single biosignal modality, but multimodal fusion can exploit complementary information from sensors. We systematically compare linear and attention-based fusion strategies across three architectures: a Multimodal MLP, a Multimodal Transformer, and a Hierarchical Transformer, evaluating performance on scenarios with unimodal and multimodal inputs. Experiments use two publicly available datasets: NinaPro DB2 (sEMG and accelerometer) and HD-sEMG 65-Gesture (high-density sEMG and force). Across both datasets, the Hierarchical Transformer with attention-based fusion consistently achieved the highest accuracy, surpassing the multimodal and best single-modality linear-fusion MLP baseline by over 10% on NinaPro DB2 and 3.7% on HD-sEMG. To investigate how modalities interact, we introduce an Isolation Network that selectively silences unimodal or cross-modal attention pathways, quantifying each group of token interactions' contribution to downstream decisions. Ablations reveal that cross-modal interactions contribute approximately 30% of the decision signal across transformer layers, highlighting the importance of attention-driven fusion in harnessing complementary modality information. Together, these findings reveal when and how multimodal fusion would enhance biosignal classification and also provides mechanistic insights of human muscle activities. The study would be beneficial in the design of sensor arrays for neurorobotic systems.

artificial intelligence, machine learning, modality, (16 more...)

arXiv.org Artificial Intelligence

2506.16744

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Vision (0.89)
Information Technology > Artificial Intelligence > Assistive Technologies (0.88)
Information Technology > Human Computer Interaction > Interfaces (0.68)

Add feedback

RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

Ni, Haotian, Wei, Yake, Liu, Hang, Chen, Gong, Peng, Chong, Lin, Hao, Hu, Di

arXiv.org Artificial IntelligenceJun-16-2025

Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available at https://github.com/GeWu-Lab/RollingQ_ICML2025.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.11465

Country: Asia > China (0.47)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Vision (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Filters

Collaborating Authors

multimodal transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Brain encoding models based on multimodal transformers can transfer across language and vision

3c8a49145944fed2bbcaade178a426c4-Paper.pdf

abb4847bbd60f38b1b7649d26c7a0067-Supplemental-Conference.pdf

Brain encoding models based on multimodal transformers can transfer across language and vision

3c8a49145944fed2bbcaade178a426c4-Paper.pdf

Brain encoding models based on multimodal transformers can transfer across language and vision

abb4847bbd60f38b1b7649d26c7a0067-Supplemental-Conference.pdf

Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs

IsoNet: Causal Analysis of Multimodal Transformers for Neuromuscular Gesture Classification

RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer