dynamic regret
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)
- North America > United States > Maryland (0.04)
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.47)
- North America > United States > New York (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- (3 more...)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- (2 more...)
- Education (0.68)
- Health & Medicine (0.45)
Small Gradient Norm Regret for Online Convex Optimization
Gao, Wenzhi, He, Chang, Udell, Madeleine
This paper introduces a new problem-dependent regret measure for online convex optimization with smooth losses. The notion, which we call the $G^\star$ regret, depends on the cumulative squared gradient norm evaluated at the decision in hindsight $\sum_{t=1}^T \|\nabla \ell(x^\star)\|^2$. We show that the $G^\star$ regret strictly refines the existing $L^\star$ (small loss) regret, and that it can be arbitrarily sharper when the losses have vanishing curvature around the hindsight decision. We establish upper and lower bounds on the $G^\star$ regret and extend our results to dynamic regret and bandit settings. As a byproduct, we refine the existing convergence analysis of stochastic optimization algorithms in the interpolation regime. Some experiments validate our theoretical findings.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > Canada > British Columbia (0.04)
- Asia > Middle East > Jordan (0.04)
Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs
Wang, Jing, Zhao, Peng, Zhou, Zhi-Hua
Abstract--Non-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically subopti-mal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a refined analysis framework, which simplifies the derivation and, importantly, produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). Moreover, we extend our framework to non-stationary Markov Decision Processes (MDPs) with function approximation, focusing on Linear Mixture MDP and Multinomial Logit (MNL) Mixture MDP . For both classes, we propose algorithms based on the weighted strategy and establish dynamic regret guarantees using our analysis framework. Index T erms--dynamic regret, non-stationary bandits, discounted factor, online MDPs, function approximation. ON-ST A TIONARY parametric bandits model the sequential decision-making problems where the reward distributions of each arm are structured with an unknown time-varying parameter, which have been extensively studied in recent years [1]-[11] due to their significance in many real-world non-stationary online applications such as recommendation systems [12], [13].
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report (0.50)
- Workflow (0.46)
Dynamic Regret of Adversarial Linear Mixture MDPs
We study reinforcement learning in episodic inhomogeneous MDPs with adversarial full-information rewards and the unknown transition kernel. We consider the linear mixture MDPs whose transition kernel is a linear mixture model and choose the \emph{dynamic regret} as the performance measure. Denote by $d$ the dimension of the feature mapping, $H$ the horizon, $K$ the number of episodes, $P_T$ the non-stationary measure, we propose a novel algorithm that enjoys an $\widetilde{\mathcal{O}}\big(\sqrt{d^2 H^3K} + \sqrt{H^4(K+P_T)(1+P_T)}\big)$ dynamic regret under the condition that $P_T$ is known, which improves previously best-known dynamic regret for adversarial linear mixture MDP and adversarial tabular MDPs.