Enhancing the Outcome Reward-based RLTraining of MLLMs with Self-Consistency Sampling

Jun-22-2026, 13:37:57 GMT–Neural Information Processing Systems

Outcome-reward reinforcement learning (RL) is a common--and increasingly significant--way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting--a dominant format for multimodal reasoning benchmarks--the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation-and-resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Jun-22-2026, 13:37:57 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Education (0.66)
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.89)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks > Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found