True Multimodal In-Context Learning Needs Attention to the Visual Context