Don't Just Chase " Highlighted Tokens " in MLLMs: Revisiting Visual Holistic Context Retention
–Neural Information Processing Systems
Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [CLS] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning ratios. To this end, we propose HoloV, a simple yet effective, plug-and-play visual token pruning framework for efficient inference.
Neural Information Processing Systems
Jun-16-2026, 09:37:29 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Research Report
- Industry:
- Information Technology (0.46)
- Health & Medicine (0.46)
- Education (0.45)
- Technology: