Symmetric Behavior Regularized Policy Optimization

Zhu, Lingwei, Shah, Haseeb, Chen, Zheng, Nagai, Yukie, White, Martha

Dec-2-2025–arXiv.org Artificial Intelligence

Behavior Regularized Policy Optimization (BRPO) leverages asymmetric (divergence) regularization to mitigate the distribution shift in offline Reinforcement Learning. This paper is the first to study the open question of symmetric regularization. We show that symmetric regularization does not permit an analytic optimal policy $π^*$, posing a challenge to practical utility of symmetric BRPO. We approximate $π^*$ by the Taylor series of Pearson-Vajda $χ^n$ divergences and show that an analytic policy expression exists only when the series is capped at $n=5$. To compute the solution in a numerically stable manner, we propose to Taylor expand the conditional symmetry term of the symmetric divergence loss, leading to a novel algorithm: Symmetric $f$-Actor Critic (S$f$-AC). S$f$-AC achieves consistently strong results across various D4RL MuJoCo tasks. Additionally, S$f$-AC avoids per-environment failures observed in IQL, SQL, XQL and AWAC, opening up possibilities for more diverse and effective regularization choices for offline RL.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

Dec-2-2025

arXiv.org PDF

Add feedback

Country:
- North America (0.28)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (0.46)
  - Machine Learning
    - Reinforcement Learning (0.68)
    - Neural Networks (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found