M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
Li, Yanshu, Cao, Yi, He, Hongyang, Cheng, Qisen, Fu, Xiang, Xiao, Xi, Wang, Tianyang, Tang, Ruixiang
–arXiv.org Artificial Intelligence
Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74\% with substantial improvements in overall efficiency.
arXiv.org Artificial Intelligence
Aug-27-2025
- Country:
- Asia (0.46)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Information Technology (0.46)
- Law (0.45)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Natural Language > Large Language Model (1.00)
- Machine Learning
- Statistical Learning (1.00)
- Neural Networks
- Deep Learning (0.67)
- Perceptrons (0.54)
- Information Technology > Artificial Intelligence