AITopics | Learning Management

We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(ηC + \sqrt{K/η})$ regret bound, where $η$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $η$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2602.07205

Country:

North America > United States > California (0.14)
Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Leisure & Entertainment > Games (0.67)
Education > Educational Setting > Online (0.60)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.60)

Add feedback

abb451a12cf1a9d93292e81f0d4fdd7a-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 18:53:03 GMT

algorithm, online, policy regret, (17 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Industry: Education > Educational Setting > Online (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.43)

Add feedback

abb451a12cf1a9d93292e81f0d4fdd7a-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-9-2026, 18:52:52 GMT

complexity, learnability, online, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.73)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.47)

Add feedback

RevisitingSmoothedOnlineLearning

Neural Information Processing SystemsFeb-9-2026, 08:07:10 GMT

In this paper, we revisit the problem of smoothed online learning, in which the online learner suffersboth ahitting costandaswitching cost, andtargettwoperformance metrics: competitiveratio anddynamic regretwith switching cost. To bound the competitive ratio, we assume the hitting cost is known to the learner in each round, and investigate the simple idea of balancing the two costs by an optimizationproblem.

artificial intelligence, dynamic regret, zhang, (18 more...)

Neural Information Processing Systems

Country: