Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)

Scholz, Robert, Bagga, Kunal, Ahrends, Christine, Barbano, Carlo Alberto

arXiv.org Artificial Intelligence 

Encoding models predict brain responses to a set of given stimuli. Recently, deep neural networks have been used as encoding models to predict brain activity as recorded by functional MRI (fMRI) [1, 2, 3, 4, 5, 6]. These studies investigate whether representations in deep neural networks correspond to those in the human brain. This relationship is often assessed using linear models, with successful prediction taken as evidence of shared representational structure. Studies have investigated representations from both unimodal and multimodal deep neural networks, including large language models (LLMs) [2, 4, 7, 8], vision models [9, 10], audio models [1, 11], and video-language models (VLMs) [12], to predict brain activity. However, existing studies face challenges in generalizability and comparability. Differences in stimulus modality, quantity, and content, as well as in preprocessing and scoring, make cross-study comparisons difficult. The Algonauts 2025 Challenge [13] provides a framework to address these issues, offering an openly available, preprocessed dataset with a large amount of data per subject and aligned stimuli across modalities, including video, audio, and transcripts, along with a standardized evaluation procedure. The challenge places particular emphasis on generalizability, including both in-distribution and out-of-distribution test sets to rigorously evaluate how well models transfer to new stimuli. 1