MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding
Nie, Zhanheng, Fu, Chenghan, Zhang, Daoze, Wu, Junxian, Guan, Wanxian, Wang, Pengjie, Xu, Jian, Zheng, Bo
–arXiv.org Artificial Intelligence
Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. T o address these, we propose MOON2.0, a dynamic modality-balanced multimodal representation learning framework for e-commerce product understanding. MOON2.0 comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. W e further introduce MBE2.0, a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.
arXiv.org Artificial Intelligence
Nov-18-2025
- Genre:
- Research Report (0.50)
- Industry:
- Information Technology > Services > e-Commerce Services (1.00)
- Technology: