GVPO: Group Variance Policy Optimization for Large Language Model Post-Training
–Neural Information Processing Systems
Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their practical adoption.
Neural Information Processing Systems
Jun-14-2026, 07:22:15 GMT
- Technology: