High-Resolution Image Synthesis via Next-Token Prediction

Chen, Dengsheng, Hu, Jie, Yue, Tiezhu, Wei, Xiaoming

Nov-22-2024–arXiv.org Artificial Intelligence

Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), an autoregressive model, has demonstrated outstanding performance in class-conditional image generation. However, the application of next-token prediction in high-resolution text-to-image generation remains underexplored. In this paper, we introduce D-JEPA$\cdot$T2I, an extension of D-JEPA incorporating flow matching loss, designed to enable data-efficient continuous resolution learning. D-JEPA$\cdot$T2I leverages a multimodal visual transformer to effectively integrate textual and visual features and adopts Visual Rotary Positional Embedding (VoPE) to facilitate continuous resolution learning. Furthermore, we devise a data feedback mechanism that significantly enhances data utilization efficiency. For the first time, we achieve state-of-the-art \textbf{high-resolution} image synthesis via next-token prediction. The experimental code and pretrained models will be open-sourced at \url{https://d-jepa.github.io/t2i}.

arxiv preprint arxiv, diffusion model, resolution, (15 more...)

arXiv.org Artificial Intelligence

Nov-22-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - Japan (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Research Report
  - Promising Solution (0.46)
  - New Finding (0.46)

Industry:
- Media (0.46)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language > Large Language Model (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)