Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models