On the generalization capacity of neural networks during generic multimodal reasoning