Token Sequence Compression for Efficient Multimodal Computing
Omri, Yasmine, Shroff, Parth, Tambe, Thierry
–arXiv.org Artificial Intelligence
The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. W e highlight the redundancy and inefficiency in current vision encoders, and seek to construct an adaptive compression method for mul-timodal data. In this work, we characterize a panoply of visual token selection and merging approaches through both benchmarking and qualitative analysis. In particular, we demonstrate that simple cluster-level token aggregation outperforms prior state-of-the-art works in token selection and merging, including merging at the vision encoder level and attention-based approaches. W e underline the redundancy in current vision encoders, and shed light on several puzzling trends regarding principles of visual token selection through cross-modal attention visualizations. This work is a first effort towards more effective encoding and processing of high-dimensional data, and paves the way for more scalable and sustainable multimodal systems.
arXiv.org Artificial Intelligence
Apr-28-2025
- Country:
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Genre:
- Research Report (0.64)
- Technology: