Chen, Zeren
HexPlane Representation for 3D Semantic Scene Understanding
Chen, Zeren, Hou, Yuenan, Chen, Yulin, Liu, Li, Sun, Xiao, Sheng, Lu
In this paper, we introduce the HexPlane representation for 3D semantic scene understanding. Specifically, we first design the View Projection Module (VPM) to project the 3D point cloud into six planes to maximally retain the original spatial information. Features of six planes are extracted by the 2D encoder and sent to the HexPlane Association Module (HAM) to adaptively fuse the most informative information for each point. The fused point features are further fed to the task head to yield the ultimate predictions. Compared to the popular point and voxel representation, the HexPlane representation is efficient and can utilize highly optimized 2D operations to process sparse and unordered 3D point clouds. It can also leverage off-the-shelf 2D models, network weights, and training recipes to achieve accurate scene understanding in 3D space. On ScanNet and SemanticKITTI benchmarks, our algorithm, dubbed HexNet3D, achieves competitive performance with previous algorithms. In particular, on the ScanNet 3D segmentation task, our method obtains 77.0 mIoU on the validation set, surpassing Point Transformer V2 by 1.6 mIoU. We also observe encouraging results in indoor 3D detection tasks. Note that our method can be seamlessly integrated into existing voxel-based, point-based, and range-based approaches and brings considerable gains without bells and whistles. The codes will be available upon publication.
A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs
He, Lehan, Chen, Zeren, Shi, Zhelun, Yu, Tianyu, Shao, Jing, Sheng, Lu
Aligning the behaviors of Multimodal Large Language Models (MLLMs) with human preferences is crucial for developing robust and trustworthy AI systems. While recent attempts have employed human experts or powerful auxiliary AI systems to provide more accurate preference feedback, such as determining the preferable responses from MLLMs or directly rewriting hallucination-free responses, extensive resource overhead compromise the scalability of the feedback collection. In this work, we introduce Topic-level Preference Overwriting (TPO), a self-correctional approach that guide the model itself to mitigate its own hallucination at the topic level. Through a deconfounded strategy that replaces each topic within the response with the best or worst alternatives generated by the model itself, TPO creates more contrasting pairwise preference feedback, enhancing the feedback quality without human or proprietary model intervention. Notably, the experimental results demonstrate proposed TPO achieves state-of-the-art performance in trustworthiness, significantly reducing the object hallucinations by 92% and overall hallucinations by 38%. Code, model and dataset are available now.
RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents
Chen, Zeren, Shi, Zhelun, Lu, Xiaoya, He, Lehan, Qian, Sucheng, Fang, Hao Shu, Yin, Zhenfei, Ouyang, Wanli, Shao, Jing, Qiao, Yu, Lu, Cewu, Sheng, Lu
The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments. Recent progress in utilizing language models as high-level planners has demonstrated that the complexity of tasks can be reduced through decomposing them into primitive-level plans, making it possible to generalize on novel robotic tasks in a composable manner. Despite the promising future, the community is not yet adequately prepared for composable generalization agents, particularly due to the lack of primitive-level real-world robotic datasets. In this paper, we propose a primitive-level robotic dataset, namely RH20T-P, which contains about 33000 video clips covering 44 diverse and complicated robotic tasks. Each clip is manually annotated according to a set of meticulously designed primitive skills, facilitating the future development of composable generalization agents. To validate the effectiveness of RH20T-P, we also construct a potential and scalable agent based on RH20T-P, called RA-P. Equipped with two planners specialized in task decomposition and motion planning, RA-P can adapt to novel physical skills through composable generalization. Our website and videos can be found at https://sites.google.com/view/rh20t-primitive/main. Dataset and code will be made available soon.
Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
Chen, Zeren, Wang, Ziqin, Wang, Zhen, Liu, Huayang, Yin, Zhenfei, Liu, Si, Sheng, Lu, Ouyang, Wanli, Qiao, Yu, Shao, Jing
Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and datasets are available at https://openlamm.github.io/paper Multimodal Large Language Models (MLLMs) (Alayrac et al., 2022; Huang et al., 2023; Liu et al., 2023; Li et al., 2023a; Zhu et al., 2023) have been considered as promising general-purpose interfaces that can perform various multimodal tasks under few-/zero-shot settings. Apart from leveraging the powerful Large Language Models (LLMs) (OpenAI, 2023; Touvron et al., 2023a) as the universal interfaces that unify the responses to different types of tasks as task-specified textual sequences, the keys to the success of MLLMs are to reliably perceive more modalities and be efficiently fine-tuned to adapt more downstream tasks. To achieve this goal, MLLMs rely on the instruction-tuning scheme (Ouyang et al., 2022) where the model is fine-tuned based on multimodal instruction-following dialogues orchestrated from various multimodal tasks. Moreover, thanks to the Parameter-Efficient Fine-Tuning (PEFT) techniques (e.g., LoRA (Hu et al., 2021) and Adapter (Houlsby et al., 2019)) where only small trainable components are injected in the model and updated during fine-tuning, recent MLLMs (Zhang et al., 2023; Yin et al., 2023; Ye et al., 2023) can efficiently learn to solve downstream tasks with a small scale of annotated data, while preserve the language proficiency and generalizability to novel situations. Remarkably, these models achieve comparable performance at low costs in comparison to LLaVA (Liu et al., 2023), KOSMOS series (Huang et al., 2023; Peng et al., 2023) and Shikra (Chen et al., 2023), which are learned by full model fine-tuning with a large amount of multimodal data. Q1: What is this object? R1: There is a monitor in the image.