Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
Chen, Zeren, Wang, Ziqin, Wang, Zhen, Liu, Huayang, Yin, Zhenfei, Liu, Si, Sheng, Lu, Ouyang, Wanli, Qiao, Yu, Shao, Jing
–arXiv.org Artificial Intelligence
Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and datasets are available at https://openlamm.github.io/paper Multimodal Large Language Models (MLLMs) (Alayrac et al., 2022; Huang et al., 2023; Liu et al., 2023; Li et al., 2023a; Zhu et al., 2023) have been considered as promising general-purpose interfaces that can perform various multimodal tasks under few-/zero-shot settings. Apart from leveraging the powerful Large Language Models (LLMs) (OpenAI, 2023; Touvron et al., 2023a) as the universal interfaces that unify the responses to different types of tasks as task-specified textual sequences, the keys to the success of MLLMs are to reliably perceive more modalities and be efficiently fine-tuned to adapt more downstream tasks. To achieve this goal, MLLMs rely on the instruction-tuning scheme (Ouyang et al., 2022) where the model is fine-tuned based on multimodal instruction-following dialogues orchestrated from various multimodal tasks. Moreover, thanks to the Parameter-Efficient Fine-Tuning (PEFT) techniques (e.g., LoRA (Hu et al., 2021) and Adapter (Houlsby et al., 2019)) where only small trainable components are injected in the model and updated during fine-tuning, recent MLLMs (Zhang et al., 2023; Yin et al., 2023; Ye et al., 2023) can efficiently learn to solve downstream tasks with a small scale of annotated data, while preserve the language proficiency and generalizability to novel situations. Remarkably, these models achieve comparable performance at low costs in comparison to LLaVA (Liu et al., 2023), KOSMOS series (Huang et al., 2023; Peng et al., 2023) and Shikra (Chen et al., 2023), which are learned by full model fine-tuning with a large amount of multimodal data. Q1: What is this object? R1: There is a monitor in the image.
arXiv.org Artificial Intelligence
Mar-13-2024
- Country:
- Europe > Switzerland
- North America > United States (1.00)
- Genre:
- Research Report > New Finding (0.94)
- Industry:
- Transportation > Ground > Rail (0.46)
- Technology: