How Visual Representations Map to Language Feature Space in Multimodal LLMs

Open in new window