M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering

Li, Yanshu, Cao, Yi, He, Hongyang, Cheng, Qisen, Fu, Xiang, Xiao, Xi, Wang, Tianyang, Tang, Ruixiang

Aug-27-2025–arXiv.org Artificial Intelligence

Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74\% with substantial improvements in overall efficiency.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Aug-27-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology (0.46)
- Law (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Neural Networks
      - Deep Learning (0.67)
      - Perceptrons (0.54)