AITopics | visual generation

Collaborating Authors

visual generation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

Neural Information Processing SystemsJun-14-2026, 13:02:59 GMT

This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a Co-Reinforcement Learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefits of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.

large language model, machine learning, reinforcement learning, (22 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Selftok-Zero: Reinforcement Learning for Visual Generation via Discrete and Autoregressive Visual Tokens

Neural Information Processing SystemsJun-13-2026, 19:07:23 GMT

Reinforcement learning (RL) has become an indispensable post-training step for unlocking the full potential of Large Language Models (LLMs). Its core motivation is to incentivize the model's inference trajectory via a reward model, effectively balancing the exploration-exploitation trade-off in scenarios where collecting exhaustive input-output ground-truth pairs is infeasible. This motivation naturally extends to visual generation, where perfect alignment between an image and a textual prompt is inherently ambiguous and often unattainable. However, existing visual generative models are not yet ready for RL due to the following two fundamental drawbacks that undermine the foundations of RL: 1) For diffusion-based models, the actual generation trajectories of sampled images cannot be reliably rewarded, as diffusion inversion is notoriously difficult.

large language model, natural language, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.59)

Add feedback

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

Neural Information Processing SystemsJun-12-2026, 06:38:03 GMT

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities.

large language model, machine learning, natural language, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)

Add feedback

OmniGen-AR: AutoRegressive Any-to-Image Generation

Neural Information Processing SystemsJun-12-2026, 02:27:49 GMT

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering competitive performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, \eg, text or category labels, restricting their applicability in real-world scenarios that demand image synthesis from diverse forms of controls.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.63)
Information Technology > Artificial Intelligence > Machine Learning (0.43)

Add feedback

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Neural Information Processing SystemsFeb-10-2026, 19:51:18 GMT

Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models.

diffusion model, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

Neural Information Processing SystemsDec-26-2025, 11:10:57 GMT

Object-centric learning aims to represent visual data with a set of object entities (a.k.a.

diffusion model, object-centric generative modeling, slotdiffusion, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.62)

Add feedback

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Neural Information Processing SystemsDec-24-2025, 20:33:14 GMT

Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to either image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window attention and causal attention for spatial and temporal modeling, respectively. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method.

artificial intelligence, machine learning, natural language, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.96)
Information Technology > Artificial Intelligence > Machine Learning (0.76)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.59)

Add feedback

Spanning Tree Autoregressive Visual Generation

Lee, Sangkyu, Lee, Changho, Han, Janghoon, Song, Hosung, You, Tackgeun, Lim, Hwasup, Choi, Stanley Jungkyu, Lee, Honglak, Yu, Youngjae

arXiv.org Artificial IntelligenceNov-24-2025

W e present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.

machine learning, natural language, sequence order, (17 more...)

arXiv.org Artificial Intelligence

2511.17089

Genre: Research Report (0.64)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)

Add feedback

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Guo, Ziyu, Zhang, Renrui, Li, Hongyu, Zhang, Manyuan, Chen, Xinyan, Wang, Sifan, Feng, Yan, Pei, Peng, Heng, Pheng-Ann

arXiv.org Artificial IntelligenceNov-21-2025

Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.16671

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

Jiang, Jingjing, Si, Chongjie, Luo, Jun, Zhang, Hanwang, Ma, Chao

arXiv.org Artificial IntelligenceNov-21-2025

This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.

large language model, machine learning, reinforcement learning, (22 more...)

arXiv.org Artificial Intelligence

2505.17534

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback