Explaining multimodal LLMs via intra-modal token interactions