AITopics | Reinforcement Learning

94739e5a5164b4d2396e253a11d57044-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 00:13:40 GMT

algorithm, arxiv preprint arxiv, q-learning, (11 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Liu, Junyan, Luo, Haipeng, Zhang, Zihan, Ratliff, Lillian J.

arXiv.org Machine LearningFeb-10-2026

We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(ηC + \sqrt{K/η})$ regret bound, where $η$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $η$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2602.07205

Country:

North America > United States > California (0.14)
Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Leisure & Entertainment > Games (0.67)
Education > Educational Setting > Online (0.60)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.60)

Add feedback

f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Haldar, Rajdeep, Mei, Lantao, Lin, Guang, Xing, Yue, Song, Qifan

arXiv.org Machine LearningFeb-10-2026

Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.

large language model, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

2602.05946

Country: North America > United States > Michigan (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

b710915795b9e9c02cf10d6d2bdb688c-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 23:58:17 GMT

The most well-known work in the reward shaping domain is the potential-based reward shaping (PBRS) method [12], which is the first to show that policy invariance can be guaranteed if the shaping reward function is in the form of the difference of potential values.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)

Add feedback

b6f8dc086b2d60c5856e4ff517060392-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 23:44:17 GMT

lemma 1, quantile, state-action pair, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Industry: Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)

Add feedback

b6f8dc086b2d60c5856e4ff517060392-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 23:44:10 GMT

algorithm, qr-dqn, quantile estimate, (12 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

78e36c70d5051e9e271b00289624d709-Paper-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 23:29:12 GMT

arxiv preprint arxiv, international conference, reinforcement learning, (11 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

In this section, we present detailed proofs for the theoretical derivation of Thm. 1, which aims to solvethefollowingoptimizationproblem: min

Neural Information Processing SystemsFeb-9-2026, 23:15:29 GMT

These assumptions are not strong and can be satisfied in most of environments includes MuJoCo, Atarigamesandsoon. Let f be an Lebesgue integrable function, P and Q are two probability distributions, |f| C,then EP(x)f(x) EQ(x)f(x) CDTV(P,Q) (5) Proof. Suppose there are two actions a1, a2 under state s, and let Q1(s,a1) = u, Q1(s,a2) = v. In this way, we can derive the upper bound of Ea π2Q1(s,a) Ea π1Q1(s,a)asabove. Since both sides of the above equation have the same minimum (here the minima are given by Qk = Q), we can replace the objective in Problem 2 with the upper bound in Eq. (10) and solve therelaxedoptimizationproblem.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

Country: