Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

Open in new window