EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

Xing, Yifei, Lan, Xiangyuan, Wang, Ruiping, Jiang, Dongmei, Huang, Wenjun, Zheng, Qingfang, Wang, Yaowei

arXiv.org Artificial Intelligence 

Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multimodal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Recently there has been a notable increase in the development of domain-general AI agents Kalla et al. (2023); Zhao et al. (2023a) which can simultaneously solve a diverse range of tasks and exhibit superior performance. Among them, multi-modal large language models (MLLMs) Achiam et al. (2023); Team et al. (2023); Liu et al. (2024a) have emerged as a promising direction due to their effectiveness in visual perception and logical reasoning. MLLMs usually consist of an image encoder that converts images to visual tokens and a strong large language model (LLM) backbone to process the visual and textual tokens concurrently. This integration of visual and textual information not only enhances the understanding of visual content but also provides a more comprehensive context for language understanding and generation.