Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)