Exploring Efficient Foundational Multi-modal Models for Video Summarization