Not enough data to create a plot.
Try a different view from the menu above.
Yin, Zhenfei
Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
Chen, Zeren, Wang, Ziqin, Wang, Zhen, Liu, Huayang, Yin, Zhenfei, Liu, Si, Sheng, Lu, Ouyang, Wanli, Qiao, Yu, Shao, Jing
Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and datasets are available at https://openlamm.github.io/paper Multimodal Large Language Models (MLLMs) (Alayrac et al., 2022; Huang et al., 2023; Liu et al., 2023; Li et al., 2023a; Zhu et al., 2023) have been considered as promising general-purpose interfaces that can perform various multimodal tasks under few-/zero-shot settings. Apart from leveraging the powerful Large Language Models (LLMs) (OpenAI, 2023; Touvron et al., 2023a) as the universal interfaces that unify the responses to different types of tasks as task-specified textual sequences, the keys to the success of MLLMs are to reliably perceive more modalities and be efficiently fine-tuned to adapt more downstream tasks. To achieve this goal, MLLMs rely on the instruction-tuning scheme (Ouyang et al., 2022) where the model is fine-tuned based on multimodal instruction-following dialogues orchestrated from various multimodal tasks. Moreover, thanks to the Parameter-Efficient Fine-Tuning (PEFT) techniques (e.g., LoRA (Hu et al., 2021) and Adapter (Houlsby et al., 2019)) where only small trainable components are injected in the model and updated during fine-tuning, recent MLLMs (Zhang et al., 2023; Yin et al., 2023; Ye et al., 2023) can efficiently learn to solve downstream tasks with a small scale of annotated data, while preserve the language proficiency and generalizability to novel situations. Remarkably, these models achieve comparable performance at low costs in comparison to LLaVA (Liu et al., 2023), KOSMOS series (Huang et al., 2023; Peng et al., 2023) and Shikra (Chen et al., 2023), which are learned by full model fine-tuning with a large amount of multimodal data. Q1: What is this object? R1: There is a monitor in the image.
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
Qian, Chen, Zhang, Jie, Yao, Wei, Liu, Dongrui, Yin, Zhenfei, Qiao, Yu, Liu, Yong, Shao, Jing
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.
Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models
Liu, Dingning, Huang, Xiaoshui, Hou, Yuenan, Wang, Zhihui, Yin, Zhenfei, Gong, Yongshun, Gao, Peng, Ouyang, Wanli
In this paper, we introduce Uni3D-LLM, a unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes. This framework empowers users to effortlessly generate and modify objects at specified locations within a scene, guided by the versatility of natural language descriptions. Uni3D-LLM harnesses the expressive power of natural language to allow for precise command over the generation and editing of 3D objects, thereby significantly enhancing operational flexibility and controllability. By mapping point cloud into the unified representation space, Uni3D-LLM achieves cross-application functionality, enabling the seamless execution of a wide array of tasks, ranging from the accurate instantiation of 3D objects to the diverse requirements of interactive design. Through a comprehensive suite of rigorous experiments, the efficacy of Uni3D-LLM in the comprehension, generation, and editing of point cloud has been validated. Additionally, we have assessed the impact of integrating a point cloud perception module on the generation and editing processes, confirming the substantial potential of our approach for practical applications.
INTERN: A New Learning Paradigm Towards General Vision
Shao, Jing, Chen, Siyu, Li, Yangguang, Wang, Kun, Yin, Zhenfei, He, Yinan, Teng, Jianing, Sun, Qinghong, Gao, Mengya, Liu, Jihao, Huang, Gengshi, Song, Guanglu, Wu, Yichao, Huang, Yuming, Liu, Fenggang, Peng, Huan, Qin, Shuo, Wang, Chengyu, Wang, Yujie, He, Conghui, Liang, Ding, Liu, Yu, Yu, Fengwei, Yan, Junjie, Lin, Dahua, Wang, Xiaogang, Qiao, Yu
Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a key challenge awaits us, that is, our capability of meeting rapidly-growing scenario-specific demands is severely limited by the cost of acquiring a commensurate amount of training data. This difficult situation is in essence due to limitations of the mainstream learning paradigm: we need to train a new model for each new scenario, based on a large quantity of well-annotated data and commonly from scratch. In tackling this fundamental problem, we move beyond and develop a new learning paradigm named INTERN. By learning with supervisory signals from multiple sources in multiple stages, the model being trained will develop strong generalizability. We evaluate our model on 26 well-known datasets that cover four categories of tasks in computer vision. In most cases, our models, adapted with only 10% of the training data in the target domain, outperform the counterparts trained with the full set of data, often by a significant margin. This is an important step towards a promising prospect where such a model with general vision capability can dramatically reduce our reliance on data, thus expediting the adoption of AI technologies. Furthermore, revolving around our new paradigm, we also introduce a new data system, a new architecture, and a new benchmark, which, together, form a general vision ecosystem to support its future development in an open and inclusive manner.