Enhancing the Outcome Reward-based RLTraining of MLLMs with Self-Consistency Sampling

Open in new window