Future Policy Aware Preference Learning for Mathematical Reasoning

Oh, Minjae, Choi, Yunho, Choi, Dongmin, Jo, Yohan

Sep-25-2025–arXiv.org Artificial Intelligence

Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model (LLM) post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token overlap between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy A ware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FP A enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MA TH and GSM8K benchmarks. FP A yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FP A provides proactive regularization while preserving the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication. Preference learning methods such as Direct Preference Optimization (DPO) (Rafailov et al., 2023) have become a standard for LLM post-training, with success across various domains like instruction-following, summarization, and model safety (Tunstall et al., 2023; Lambert et al., 2024).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Sep-25-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found