AToken: A Unified Tokenizer for Vision

Lu, Jiasen, Song, Liangchen, Xu, Mingze, Ahn, Byeongjoo, Wang, Yanjun, Chen, Chen, Dehghan, Afshin, Yang, Yinfei

Sep-22-2025–arXiv.org Artificial Intelligence

Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization. Large Language Models (LLMs) (Chowdhery et al., 2023; Achiam et al., 2023; Touvron et al., 2023; Team et al., 2023; Guo et al., 2025) have achieved unprecedented generalization, with single models handling coding, reasoning, translation, and numerous other tasks that previously required specialized systems. This versatility largely stems from transformer architectures and simple tokenizers, such as BPE (Sennrich et al., 2015), which convert all text types - code, documents, tables, and multiple languages - into a unified token space. This shared representation enables efficient scaling and seamless knowledge transfer across language tasks. In contrast, visual representations remain fragmented due to inherent complexities. Unlike text's discrete symbolic nature, visual tasks demand distinct levels of abstraction: generation requires tokenizers that preserve low-level visual details for reconstruction, while understanding requires encoders that extract high-level semantic features through text alignment. Moreover, visual data exists in disparate formats: 2D grids for images, temporal sequences for videos, and varied 3D representations (e.g., meshes, voxels, and Gaussian splats) (Mescheder et al., 2019; Achlioptas et al., 2018; Mildenhall et al., 2021; Kerbl et al., 2023). Without a shared representation, vision systems remain fundamentally limited, unable to achieve the generalization and transfer learning that characterizes modern language models. Despite recent progress, unified visual tokenizers face three fundamental challenges. First, existing approaches optimize for either reconstruction or understanding, but not both: visual encoders (Radford et al., 2021; Zhai et al., 2023; Bolya et al., 2025) achieve semantic alignment but lack Description of each author's contribution is available in Appendix A. Corresponding to Jiasen Lu. Second, architectural choices create different limitations: convolutional tokenizers exhibit diminishing returns when scaling model parameters (Xiong et al., 2025), while transformer tokenizers (Y u et al., 2021; Wang et al., 2024b; Hansen-Estruch et al., 2025) achieve better scaling but suffer from severe adversarial training instabilities.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Sep-22-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)