VIBE: Video-Input Brain Encoder for fMRI Response Modeling

Schad, Daniel Carlström, Dixit, Shrey, Keck, Janis, Studenyak, Viktor, Shpilevoi, Aleksandr, Bicanski, Andrej

Jul-28-2025–arXiv.org Artificial Intelligence

We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 0.3225 on in-distribution Friends S07 and 0.2125 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jul-28-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Germany (0.29)

Genre:
- Research Report (0.84)

Industry:
- Health & Medicine
  - Health Care Technology (0.87)
  - Therapeutic Area > Neurology (1.00)
- Media > Television (0.68)

Technology:
- Information Technology
  - Artificial Intelligence
    - Cognitive Science (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.93)
    - Natural Language (1.00)
  - Data Science (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found