Align Where the Words Look: Cross-Attention-Guided Patch Alignment with Contrastive and Transport Regularization for Bengali Captioning
Anonto, Riad Ahmed, Zabin, Sardar Md. Saffat, Rahman, M. Saifur
–arXiv.org Artificial Intelligence
Grounding vision--language models in low-resource languages remains challenging, as they often produce fluent text about the wrong objects. This stems from scarce paired data, translation pivots that break alignment, and English-centric pretraining that ignores target-language semantics. We address this with a compute-aware Bengali captioning pipeline trained on LaBSE-verified EN--BN pairs and 110k bilingual-prompted synthetic images. A frozen MaxViT yields stable visual patches, a Bengali-native mBART-50 decodes, and a lightweight bridge links the modalities. Our core novelty is a tri-loss objective: Patch-Alignment Loss (PAL) aligns real and synthetic patch descriptors using decoder cross-attention, InfoNCE enforces global real--synthetic separation, and Sinkhorn-based OT ensures balanced fine-grained patch correspondence. This PAL+InfoNCE+OT synergy improves grounding, reduces spurious matches, and drives strong gains on Flickr30k-1k (BLEU-4 12.29, METEOR 27.98, BERTScore-F1 71.20) and MSCOCO-1k (BLEU-4 12.00, METEOR 28.14, BERTScore-F1 75.40), outperforming strong CE baselines and narrowing the real--synthetic centroid gap by 41%.
arXiv.org Artificial Intelligence
Sep-24-2025
- Country:
- Asia
- Bangladesh > Dhaka Division
- Dhaka District > Dhaka (0.04)
- Singapore (0.04)
- Bangladesh > Dhaka Division
- Europe
- Finland > Uusimaa
- Helsinki (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Switzerland (0.04)
- Finland > Uusimaa
- North America
- Canada > Ontario
- Toronto (0.04)
- United States (0.04)
- Canada > Ontario
- Asia
- Genre:
- Research Report (0.50)
- Technology: