Re-Imagining Multimodal Instruction Tuning: A Representation View

Liu, Yiyang, Liang, James Chenhao, Tang, Ruixiang, Lee, Yugyung, Rabbani, Majid, Dianat, Sohail, Rao, Raghuveer, Huang, Lifu, Liu, Dongfang, Wang, Qifan, Han, Cheng

Mar-20-2025–arXiv.org Artificial Intelligence

Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Mar-20-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Missouri > Jackson County
    - Kansas City (0.04)
  - California > Yolo County
    - Davis (0.04)
- Europe > Spain
  - Aragón (0.04)
- Asia > Myanmar
  - Tanintharyi Region > Dawei (0.04)

Genre:
- Research Report
  - New Finding (0.88)
  - Promising Solution (0.66)

Industry:
- Information Technology (0.87)
- Government
  - Military (0.46)
  - Regional Government > North America Government
    - United States Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found