Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Open in new window