DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

May-26-2025, 19:33:33 GMT–Neural Information Processing Systems

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture *DeepStack* for LMMs. Considering N layers in the language and vision transformer of LMMs, we stack the visual tokens into N groups and feed each group to its aligned transformer layer from bottom to top. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

May-26-2025, 19:33:33 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.61)