Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval
–Neural Information Processing Systems
Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment ij between text ti and video vj, redistributing gradients to relieve optimization tension and absorb noise. We derive ij via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation.
Neural Information Processing Systems
Jun-23-2026, 05:15:26 GMT
- Genre:
- Research Report > Experimental Study (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Representation & Reasoning (1.00)
- Natural Language (1.00)
- Vision (0.93)
- Machine Learning
- Statistical Learning (0.68)
- Neural Networks (0.67)
- Information Technology > Artificial Intelligence