AITopics | language and vision

Brain encoding models based on multimodal transformers can transfer across language and vision

Neural Information Processing SystemsMay-1-2026, 03:37:22 GMT

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Cognitive Science > Neuroscience (0.66)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Brain encoding models based on multimodal transformers can transfer across language and vision

Neural Information Processing SystemsFeb-12-2026, 11:25:35 GMT

Encoding models have been used to assess how the human brain represents concepts in language and vision.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.94)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.72)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
(2 more...)

Add feedback

Brain encoding models based on multimodal transformers can transfer across language and vision

Neural Information Processing SystemsDec-25-2025, 13:48:06 GMT

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.

language and vision, multimodal transformer, representation, (5 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.60)

Industry: Health & Medicine (0.60)

Technology: Information Technology > Artificial Intelligence (0.40)

Add feedback

Revisiting Neural Scaling Laws in Language and Vision

Neural Information Processing SystemsDec-24-2025, 18:13:08 GMT

The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.

language and vision, name change, revisiting neural scaling law, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Long-Short Transformer: Efficient Transformers for Language and Vision

Neural Information Processing SystemsDec-24-2025, 12:07:58 GMT

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3x as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224x224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images.

efficient transformer, long-short transformer, transformer, (10 more...)

Neural Information Processing Systems

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.07)

Technology: Information Technology > Artificial Intelligence (0.75)

Add feedback

Long-Short Transformer: Efficient Transformers for Language and Vision (Appendix) A Details of Norm Comparisons

Neural Information Processing SystemsAug-16-2025, 02:08:46 GMT

It designs to measure the extent of the model relying on the image background.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.95)
Information Technology > Artificial Intelligence > Natural Language (0.69)

Add feedback

), addressing a hard problem that is important in robotics (R4), while also extremely relevant to NeurIPS (R1

Neural Information Processing SystemsAug-15-2025, 07:22:19 GMT

We thank all reviewers for their constructive feedback! We collected 200 such descriptions (40 per annotator). New Users", users typed an instruction and saw the result in a physics-based simulation in real-time (line 263). R2 Faster R-CNN isn't trained on data that looks anything like this. FPFH would not be applicable since it is a 3D point-cloud approach requiring access to a depth camera.

hard problem, neurips, section 3, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Revisiting Neural Scaling Laws in Language and Vision

Neural Information Processing SystemsMay-27-2025, 14:09:02 GMT

The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.

dataset, language and vision, revisiting neural scaling law

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Brain encoding models based on multimodal transformers can transfer across language and vision

Neural Information Processing SystemsJan-18-2025, 16:57:08 GMT

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning.

language and vision, multimodal transformer, representation, (3 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.64)

Technology: Information Technology > Artificial Intelligence (0.44)

Add feedback

Revisiting Neural Scaling Laws in Language and Vision

Neural Information Processing SystemsJan-17-2025, 15:44:09 GMT

The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.

dataset, language and vision, revisiting neural scaling law

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Filters

Collaborating Authors

language and vision

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Brain encoding models based on multimodal transformers can transfer across language and vision

Brain encoding models based on multimodal transformers can transfer across language and vision

Brain encoding models based on multimodal transformers can transfer across language and vision

Revisiting Neural Scaling Laws in Language and Vision

Long-Short Transformer: Efficient Transformers for Language and Vision

Long-Short Transformer: Efficient Transformers for Language and Vision (Appendix) A Details of Norm Comparisons

), addressing a hard problem that is important in robotics (R4), while also extremely relevant to NeurIPS (R1

Revisiting Neural Scaling Laws in Language and Vision

Brain encoding models based on multimodal transformers can transfer across language and vision

Revisiting Neural Scaling Laws in Language and Vision