FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

Open in new window