One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Jun-12-2026, 05:04:59 GMT–Neural Information Processing Systems

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into and modules for (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts,, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Jun-12-2026, 05:04:59 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)