Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
–Neural Information Processing Systems
To this end, this paper proposes UNIFIEDREWARD-THINK, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement finetuning approach to elicit and incentivize the model's latent complex reasoning
Neural Information Processing Systems
Jun-23-2026, 00:59:38 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Research Report
- Industry:
- Leisure & Entertainment > Sports > Tennis (0.93)
- Technology: