Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

Jun-11-2026, 05:55:38 GMT–Neural Information Processing Systems

In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance.

artificial intelligence, natural language, proceedings, (9 more...)

Neural Information Processing Systems

Jun-11-2026, 05:55:38 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language (0.59)