Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding