Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Jun-13-2026, 17:45:14 GMT–Neural Information Processing Systems

Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies.

large language model, machine learning, reinforcement learning, (12 more...)

Neural Information Processing Systems

Jun-13-2026, 17:45:14 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (0.95)
  - Natural Language > Large Language Model (0.81)
  - Machine Learning > Reinforcement Learning (0.64)
  - Cognitive Science > Problem Solving (0.59)