moec
MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention
Pang, Yuqi, Yang, Bowen, Cao, Yun, Fan, Rong, Li, Xiaoyu, He, Chen
Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features. We train MoCHA on two mainstream LLMs (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks. Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks. For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA.
MoEC: Mixture of Experts Implicit Neural Compression
Zhao, Jianchen, Tseng, Cheng-Ching, Lu, Ming, An, Ruichuan, Wei, Xiaobao, Sun, He, Zhang, Shanghang
Emerging Implicit Neural Representation (INR) is a promising data compression technique, which represents the data using the parameters of a Deep Neural Network (DNN). Existing methods manually partition a complex scene into local regions and overfit the INRs into those regions. However, manually designing the partition scheme for a complex scene is very challenging and fails to jointly learn the partition and INRs. To solve the problem, we propose MoEC, a novel implicit neural compression method based on the theory of mixture of experts. Specifically, we use a gating network to automatically assign a specific INR to a 3D point in the scene. The gating network is trained jointly with the INRs of different local regions. Compared with block-wise and tree-structured partitions, our learnable partition can adaptively find the optimal partition in an end-to-end manner. We conduct detailed experiments on massive and diverse biomedical data to demonstrate the advantages of MoEC against existing approaches. In most of experiment settings, we have achieved state-of-the-art results. Especially in cases of extreme compression ratios, such as 6000x, we are able to uphold the PSNR of 48.16.
MoEC: Mixture of Expert Clusters
Xie, Yuan, Huang, Shaohan, Chen, Tianyu, Wei, Furu
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation. Such problems are especially severe on tasks with limited data, thus hindering the progress for MoE models to improve performance by scaling up. In this work, we propose Mixture of Expert Clusters - a general approach to enable expert layers to learn more diverse and appropriate knowledge by imposing variance-based constraints on the routing stage. We further propose a cluster-level expert dropout strategy specifically designed for the expert cluster structure. Our experiments reveal that MoEC could improve performance on machine translation and natural language understanding tasks, and raise the performance upper bound for scaling up experts under limited data. We also verify that MoEC plays a positive role in mitigating overfitting and sparse data allocation.