Rethinking Video Tokenization: A Conditioned Diffusion-based Approach