Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

Open in new window