Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation

Jun-12-2026, 08:19:47 GMT–Neural Information Processing Systems

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, \textbf{\ours}, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency.

artificial intelligence, name change, proceedings, (8 more...)

Neural Information Processing Systems

Jun-12-2026, 08:19:47 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.40)