Token Sequence Compression for Efficient Multimodal Computing

Omri, Yasmine, Shroff, Parth, Tambe, Thierry

Apr-28-2025–arXiv.org Artificial Intelligence

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. W e highlight the redundancy and inefficiency in current vision encoders, and seek to construct an adaptive compression method for mul-timodal data. In this work, we characterize a panoply of visual token selection and merging approaches through both benchmarking and qualitative analysis. In particular, we demonstrate that simple cluster-level token aggregation outperforms prior state-of-the-art works in token selection and merging, including merging at the vision encoder level and attention-based approaches. W e underline the redundancy in current vision encoders, and shed light on several puzzling trends regarding principles of visual token selection through cross-modal attention visualizations. This work is a first effort towards more effective encoding and processing of high-dimensional data, and paves the way for more scalable and sustainable multimodal systems.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Apr-28-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Statistical Learning (0.47)
  - Natural Language
    - Text Processing (0.46)
    - Large Language Model (0.30)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found