Overleaf Example
–Neural Information Processing Systems
Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text pairs.
Neural Information Processing Systems
Mar-18-2025, 05:36:25 GMT
- Genre:
- Overview (1.00)
- Research Report > Experimental Study (0.93)
- Industry:
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.67)
- Leisure & Entertainment > Sports
- Motorsports > NASCAR (0.92)
- Technology: