CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Jun-2-2025, 11:38:42 GMT–Neural Information Processing Systems

Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization.

large language model, machine learning, vae, (18 more...)

Neural Information Processing Systems

Jun-2-2025, 11:38:42 GMT

Conferences PDF

Add feedback

Country:
- Europe > Switzerland > Zürich > Zürich (0.14)

Genre:
- Research Report > Experimental Study (0.93)

Industry:
- Media (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.34)
  - Natural Language > Large Language Model (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Similar Docs Excel Report more

Title	Similarity	Source
None found