Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Neural Information Processing Systems 

In this paper, we study the visual redundancy problem of multimodal large language models (MLLMs) from the perspective of attention behaviors. Via extensive empirical experiments, we observe and conclude three main inference stages of MLLMs: (i) Early fusion between tokens is first accomplished quickly.