Dual-branch Prompting for Multimodal Machine Translation
Wang, Jie, Yang, Zhendong, Zong, Liansong, Zhang, Xiaobo, Wang, Dexian, Zhang, Ji
–arXiv.org Artificial Intelligence
Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.
arXiv.org Artificial Intelligence
Dec-5-2025
- Country:
- Asia
- China > Sichuan Province
- Chengdu (0.04)
- India > Meghalaya
- Shillong (0.04)
- Taiwan > Taiwan Province
- Taipei (0.04)
- China > Sichuan Province
- Europe
- Belgium (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Italy > Tuscany
- Florence (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- North America > Canada
- Asia
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine (0.46)
- Technology: