DeepStack: DeeplyStackingVisualTokens isSurprisinglySimpleandEffectiveforLMMs
–Neural Information Processing Systems
This inevitably introduces a tremendous memory andcompute overheadintotheLLMs, whichisparticularly significant when it comes to high-resolution images and multi-frame videos. Several previous works attempt to mitigate this issue by proposing various token compression strategies. A straightforward way is to reduce the number of tokens with spatial grouping [70, 47]. Instead of pooling vision tokens, a few work instead to concatenate local tokens along the feature dimension to preserve visual information [11, 48]. Moreover, other works seek more sophisticated token resampling, such as Q-Former [43], Perceiver [4]and Abstractor [8],etc.
Neural Information Processing Systems
Feb-9-2026, 23:44:47 GMT
- Country:
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Technology: