FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models