algonaut 2025
Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)
Eren, Semih, Kucukahmetler, Deniz, Scherf, Nico
Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)
Scholz, Robert, Bagga, Kunal, Ahrends, Christine, Barbano, Carlo Alberto
Encoding models predict brain responses to a set of given stimuli. Recently, deep neural networks have been used as encoding models to predict brain activity as recorded by functional MRI (fMRI) [1, 2, 3, 4, 5, 6]. These studies investigate whether representations in deep neural networks correspond to those in the human brain. This relationship is often assessed using linear models, with successful prediction taken as evidence of shared representational structure. Studies have investigated representations from both unimodal and multimodal deep neural networks, including large language models (LLMs) [2, 4, 7, 8], vision models [9, 10], audio models [1, 11], and video-language models (VLMs) [12], to predict brain activity. However, existing studies face challenges in generalizability and comparability. Differences in stimulus modality, quantity, and content, as well as in preprocessing and scoring, make cross-study comparisons difficult. The Algonauts 2025 Challenge [13] provides a framework to address these issues, offering an openly available, preprocessed dataset with a large amount of data per subject and aligned stimuli across modalities, including video, audio, and transcripts, along with a standardized evaluation procedure. The challenge places particular emphasis on generalizability, including both in-distribution and out-of-distribution test sets to rigorously evaluate how well models transfer to new stimuli. 1
TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction
d'Ascoli, Stéphane, Rapin, Jérémy, Benchetrit, Yohann, Banville, Hubert, King, Jean-Rémi
Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are systematically outperformed by our multimodal model in high-level associative cortices. Currently applied to perception and comprehension, our approach paves the way towards building an integrative model of representations in the human brain. Our code is available at https://github.com/facebookresearch/algonauts-2025.