STIV: Scalable Text and Image Conditioned Video Generation