DeepStack: DeeplyStackingVisualTokens isSurprisinglySimpleandEffectiveforLMMs

Neural Information Processing Systems 

This inevitably introduces a tremendous memory andcompute overheadintotheLLMs, whichisparticularly significant when it comes to high-resolution images and multi-frame videos. Several previous works attempt to mitigate this issue by proposing various token compression strategies. A straightforward way is to reduce the number of tokens with spatial grouping [70, 47]. Instead of pooling vision tokens, a few work instead to concatenate local tokens along the feature dimension to preserve visual information [11, 48]. Moreover, other works seek more sophisticated token resampling, such as Q-Former [43], Perceiver [4]and Abstractor [8],etc.