Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
–Neural Information Processing Systems
This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary.
Neural Information Processing Systems
Jun-14-2026, 06:40:57 GMT
- Technology: