Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Open in new window