Diffusion-Inspired Truncated Sampler for Text-Video Retrieval

May-28-2025, 08:43:35 GMT–Neural Information Processing Systems

Prevalent text-to-video retrieval methods represent multimodal text-video data in a joint embedding space, aiming at bridging the relevant text-video pairs and pulling away irrelevant ones. One main challenge in state-of-the-art retrieval methods lies in the modality gap, which stems from the substantial disparities between text and video and can persist in the joint space. In this work, we leverage the potential of Diffusion models to address the text-video modality gap by progressively aligning text and video embeddings in a unified space.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

May-28-2025, 08:43:35 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.67)
  - Natural Language (1.00)
  - Vision (1.00)