Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

Open in new window