Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Mao, Hanyi, Xiao, Quanjia, Pang, Lei, Liu, Haixiao

Oct-14-2025–arXiv.org Artificial Intelligence

We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping on the importance-sampling (IS) weight. We study RL methods with sequence-level IS and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the optimization direction. FSPO introduces a simple remedy: we clip the sequence log-IS ratio with a band that scales as $\sqrt{L}$. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a cosine directional guarantee between the clipped and true updates. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms baselines across model sizes and evaluation datasets, with the largest gains on the Qwen3-8B-Base model.

large language model, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

Oct-14-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States
  - Illinois > Cook County > Chicago (0.04)
- Oceania > Australia
  - New South Wales > Sydney (0.04)

Genre:
- Instructional Material > Course Syllabus & Notes (0.46)
- Research Report (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.69)
  - Natural Language > Large Language Model (0.50)