DeepStack: DeeplyStackingVisualTokens isSurprisinglySimpleandEffectiveforLMMs

Feb-9-2026, 23:44:47 GMT–Neural Information Processing Systems

This inevitably introduces a tremendous memory andcompute overheadintotheLLMs, whichisparticularly significant when it comes to high-resolution images and multi-frame videos. Several previous works attempt to mitigate this issue by proposing various token compression strategies. A straightforward way is to reduce the number of tokens with spatial grouping [70, 47]. Instead of pooling vision tokens, a few work instead to concatenate local tokens along the feature dimension to preserve visual information [11, 48]. Moreover, other works seek more sophisticated token resampling, such as Q-Former [43], Perceiver [4]and Abstractor [8],etc.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Feb-9-2026, 23:44:47 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > California > Santa Clara County > Palo Alto (0.04)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.68)
  - Natural Language > Large Language Model (0.72)

Duplicate Docs Excel Report

Title
29cd7f8331d13ede6dc6d6ef3dfacb70-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found