Explaining Multi-modal Large Language Models by Analyzing their Vision Perception