Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation