Apollo: An Exploration of Video Understanding in Large Multimodal Models

Zohar, Orr, Wang, Xiaohan, Dubois, Yann, Mehta, Nikhil, Xiao, Tong, Hansen-Estruch, Philippe, Yu, Licheng, Wang, Xiaofang, Juefei-Xu, Felix, Zhang, Ning, Yeung-Levy, Serena, Xia, Xide

arXiv.org Artificial Intelligence 

Despite the rapid advancements in language and image-language modeling (Hoffmann et al., 2022; Brown, 2020; Yang et al., 2024; Liu et al., 2024a; Alayrac et al., 2022; Laurençon et al., 2024a; OpenAI, 2024), the development of video Large Multimodal Models (video-LMMs) has not kept pace. Videos provide a rich, dynamic information source, capturing nuanced temporal and spatial features beyond the reach of static images. However, video-LMMs remain under-explored, hampered by unique challenges: notably higher computational demands and a broader, more complex design space compared to their image-based counterparts (Li et al., 2023a, 2025; Liu et al., 2024d; Li et al., 2024b; Xu et al., 2024a). Many fundamental questions about video-LMM design remain unanswered: How should videos be sampled? Which vision encoders yield optimal representations? What are the best practices for resampling video tokens? Early approaches primarily extended image-LMMs directly (Xu et al., 2024b; Kim et al., 2024; Wu, 2024; Zhang et al., 2024e) or with video-specific fine-tuning (Li et al., 2023a; Zhang et al., 2023; Maaz et al., 2023). Recent methods introduced diverse design choices, such as longer context windows (Zhang et al., 2024e), multi-modality mixing (Li et al., 2024a,c), agent workflows (Wang et al., 2024c), self-training (Zohar et al., 2024), and more. Despite these efforts, the impact of these design decisions on video-LMM performance is poorly understood.