Brain encoding models based on multimodal transformers can transfer across language and vision

Jan-18-2025, 16:57:08 GMT–Neural Information Processing Systems

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning.

language and vision, multimodal transformer, representation, (3 more...)

Neural Information Processing Systems

Jan-18-2025, 16:57:08 GMT

Conferences Web Page

Add feedback

Industry:
- Health & Medicine (0.64)

Technology:
- Information Technology > Artificial Intelligence (0.44)