Evaluating and Steering Modality Preferences in Multimodal Large Language Model
Zhang, Yu, Ma, Jinlong, Hou, Yongshuai, Bai, Xuefeng, Chen, Kehai, Xiang, Yang, Yu, Jun, Zhang, Min
–arXiv.org Artificial Intelligence
Multimodal large language models (MLLMs) have achieved remarkable success on complex multimodal tasks. However, it remains insufficiently explored whether they exhibit modality preference, a tendency to favor one modality over another when processing multimodal contexts. Extensive experiments reveal that all 20 tested MLLMs generally demonstrate clear modality preferences, and such preferences can serve as a useful indicator of downstream task performances of MLLMs. Further analysis shows that modality preference can be controlled by instruction guidance and captured within the latent representations of MLLMs. Built on these insights, we propose a probing and steering method based on representation engineering to explicitly control modality preference without requiring additional fine-tuning. This method effectively amplifies modality preference toward a desired direction and demonstrates promising improvements across multiple downstream applications, including multimodal visual understanding and multimodal machine translation. Multimodal Large Language Models (MLLMs; Achiam et al., 2023; Team et al., 2023; Wang et al., 2024; Yin et al., 2024) have emerged as a powerful paradigm for processing and reasoning across heterogeneous data modalities (e.g., text, images, video). Recent advances demonstrate their exceptional capabilities on complex tasks with multimodal contexts, including autonomous web browsing (He et al., 2024), graphical user interface understanding (Hong et al., 2024b), and multimodal dialogue systems (Sun et al., 2022). Despite impressive performance, fundamental questions remain about their modality preference--whether MLLMs tend to rely more heavily on one modality than others, and to what extent they favor a specific modality when resolving multimodal inputs. To investigate this, one line of work (Fu et al., 2024; Amara et al., 2024) compares model performance on unimodal input, providing either only text or only image input for the same question. Another line of research analyzes the relative contributions of textual and visual context, typically by removing one modality to observe the changes of the downstream performance (Park et al., 2025) or Shapley value (Alishahi et al., 2019; Parcalabescu & Frank, 2024; 2022).
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Africa > Central African Republic
- Ombella-M'Poko > Bimbo (0.04)
- Asia > China
- Guangdong Province > Shenzhen (0.04)
- Heilongjiang Province > Harbin (0.04)
- Europe
- Austria > Vienna (0.14)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Switzerland > Zürich
- Zürich (0.14)
- North America
- Dominican Republic (0.04)
- United States > Florida
- Miami-Dade County > Miami (0.04)
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Africa > Central African Republic
- Genre:
- Research Report > Experimental Study (0.46)
- Technology: