Goto

Collaborating Authors

 tokenizer




Image Understanding Makes for A Good Tokenizer for Image Generation Luting Wang Y ang Zhao

Neural Information Processing Systems

Modern image generation (IG) models have been shown to capture rich semantics valuable for image understanding (IU) tasks. However, the potential of IU models to improve IG performance remains uncharted. We address this issue using a token-based IG framework, which relies on effective tokenizers to map images into token sequences. Currently, pixel reconstruction (e.g., VQGAN) dominates the training objective for tokenizers. In contrast, our approach adopts the feature reconstruction objective, where tokenizers are trained by distilling knowledge from pretrained IU encoders. Comprehensive comparisons indicate that tokeniz-ers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks.



Extending Video Masked Autoencoders to 128 frames

Neural Information Processing Systems

Video understanding has witnessed significant progress with recent video foundation models demonstrating strong performance owing to self-supervised pre-training objectives; Masked Autoencoders (MAE) being the design of choice.