Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval

Jun-23-2026, 05:15:26 GMT–Neural Information Processing Systems

Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment ij between text ti and video vj, redistributing gradients to relieve optimization tension and absorb noise. We derive ij via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Jun-23-2026, 05:15:26 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Vision (0.93)
  - Machine Learning
    - Statistical Learning (0.68)
    - Neural Networks (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found