A Culturally-diverse Multilingual Multimodal Video Benchmark & Model