What Makes Multimodal In-Context Learning Work?