Optimizing Multimodal Language Models through Attention-based Interpretability