Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Jun-14-2026, 06:40:57 GMT–Neural Information Processing Systems

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary.

large language model, natural language, proceedings, (4 more...)

Neural Information Processing Systems

Jun-14-2026, 06:40:57 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)