What do vision-language models see in the context? Investigating multimodal in-context learning