Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation
–Neural Information Processing Systems
In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, \textbf{\ours}, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency.
Neural Information Processing Systems
Jun-12-2026, 08:19:47 GMT
- Technology: