Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation