3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow
Ma, Yueen, Zhuang, Yuzheng, Hao, Jianye, King, Irwin
–arXiv.org Artificial Intelligence
In recent years, 3D instruction-following data has become more common, and with the advent of large language models 3D vision and spatial reasoning have long been (LLMs), a range of multi-modal LLMs (MLLMs) has recognized as preferable for accurately perceiving emerged. Following the success of LLaVA (Liu et al., our three-dimensional world, especially when 2023a) for 2D images, recent approaches (e.g., LEO (Huang compared with traditional visual reasoning based et al., 2024) and ShapeLLM (Qi et al., 2024)) also integrate on 2D images. Due to the difficulties in collecting 3D encoders into LLMs through simple linear projection high-quality 3D data, research in this area has layers. Although these models handle tasks such as 3D only recently gained momentum. With the advent question answering, 3D dialogue, and some embodied tasks, of powerful large language models (LLMs), multimodal they devote relatively little attention to optimizing the LLM LLMs for 3D vision have been developed itself for multi-modal data.
arXiv.org Artificial Intelligence
Jan-27-2025