What do vision-language models see in the context? Investigating multimodal in-context learning

Open in new window