HOFAR: High-Order Augmentation of Flow Autoregressive Transformers

Liang, Yingyu, Sha, Zhizhou, Shi, Zhenmei, Song, Zhao, Wan, Mingda

arXiv.org Artificial Intelligence 

Several works have explored extending these models to generate images with an additional dimension, such as incorporating a temporal dimension for video generation [SPH + 22, LCW + 23] or a 3D spatial dimension for 3D object generation [XXMPM24, Mo24]. Even 4D generation [ZCW + 25, LYX + 24] has become feasible using diffusion models. Another prominent line of research focuses on auto-regressive models, where the Transformer framework has achieved groundbreaking success in natural language processing. Models such as GPT-4 [AAA + 23], Gemini 2 [Dee24], and DeepSeek [GYZ + 25] have significantly impacted millions of users worldwide. Given the success of the auto-regressive generation paradigm and the Transformer framework, recent works have explored integrating auto-regressive generation into image generation. A representative example is the Visual Auto-Regressive (VAR) model [TJY + 25], which introduces hierarchical image generation with different image patches.