DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
–Neural Information Processing Systems
Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture *DeepStack* for LMMs. Considering N layers in the language and vision transformer of LMMs, we stack the visual tokens into N groups and feed each group to its aligned transformer layer from bottom to top. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost.
Neural Information Processing Systems
May-26-2025, 19:33:33 GMT
- Technology: