MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering
He, Lixuan, Zheng, Shikang, Zhang, Linfeng
–arXiv.org Artificial Intelligence
Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.
arXiv.org Artificial Intelligence
Oct-7-2025
- Country:
- Asia
- China > Shanghai
- Shanghai (0.04)
- Middle East > Israel
- Jerusalem District > Jerusalem (0.04)
- China > Shanghai
- Asia
- Genre:
- Research Report > Promising Solution (0.48)
- Industry:
- Education (0.34)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks (1.00)
- Statistical Learning > Clustering (0.94)
- Natural Language (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence