ln 3
Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis
Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded ${\frak p}$-th moment where ${\frak p}\in(1,2]$ has been recognized to be more realistic (say being upper bounded by $ฯ_{\frak l}^{\frak p}$ for some $ฯ_{\frak l}\ge0$). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate ${\cal O}(ฯ_{\frak l}\ln(1/ฮด)T^{1/{\frak p}-1})$ (resp. ${\cal O}(ฯ_{\frak l}^2\ln^2(1/ฮด)T^{2/{\frak p}-2})$) for nonsmooth convex (resp. strongly convex) problems, where $ฮด\in(0,1]$ is the failure probability and $T\in\mathbb{N}$ is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two faster rates, ${\cal O}(ฯ_{\frak l}d_{\rm eff}^{-1/2{\frak p}}\ln^{1-1/{\frak p}}(1/ฮด)T^{1/{\frak p}-1})$ and ${\cal O}(ฯ_{\frak l}^2d_{\rm eff}^{-1/{\frak p}}\ln^{2-2/{\frak p}}(1/ฮด)T^{2/{\frak p}-2})$, than the aforementioned best results, where $d_{\rm eff}\ge1$ is a quantity we call the $\textit{generalized effective dimension}$. Our analysis improves upon the existing approach on two sides: better utilization of Freedman's inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to convergence in expectation and obtain new rates that break the known lower bounds. Lastly, to complement the study, we establish new lower bounds for both high-probability and in-expectation convergence. Notably, the in-expectation lower bounds match our new upper bounds, indicating the optimality of our refined analysis for convergence in expectation.
Gradient Compressed Sensing: A Query-Efficient Gradient Estimator for High-Dimensional Zeroth-Order Optimization
We study nonconvex zeroth-order optimization (ZOO) in a high-dimensional space $\mathbb R^d$ for functions with approximately $s$-sparse gradients. To reduce the dependence on the dimensionality $d$ in the query complexity, high-dimensional ZOO methods seek to leverage gradient sparsity to design gradient estimators. The previous best method needs $O\big(s\log\frac ds\big)$ queries per step to achieve $O\big(\frac1T\big)$ rate of convergence w.r.t. the number T of steps. In this paper, we propose *Gradient Compressed Sensing* (GraCe), a query-efficient and accurate estimator for sparse gradients that uses only $O\big(s\log\log\frac ds\big)$ queries per step and still achieves $O\big(\frac1T\big)$ rate of convergence. To our best knowledge, we are the first to achieve a *double-logarithmic* dependence on $d$ in the query complexity under weaker assumptions. Our proposed GraCe generalizes the Indyk--Price--Woodruff (IPW) algorithm in compressed sensing from linear measurements to nonlinear functions. Furthermore, since the IPW algorithm is purely theoretical due to its impractically large constant, we improve the IPW algorithm via our *dependent random partition* technique together with our corresponding novel analysis and successfully reduce the constant by a factor of nearly 4300. Our GraCe is not only theoretically query-efficient but also achieves strong empirical performance. We benchmark our GraCe against 12 existing ZOO methods with 10000-dimensional functions and demonstrate that GraCe significantly outperforms existing methods.
Learning-based Scheduling for Information Accuracy and Freshness in Wireless Networks
We consider a system of multiple sources, a single communication channel, and a single monitoring station. Each source measures a time-varying quantity with varying levels of accuracy and one of them sends its update to the monitoring station via the channel. The probability of success of each attempted communication is a function of the source scheduled for transmitting its update. Both the probability of correct measurement and the probability of successful transmission of all the sources are unknown to the scheduler. The metric of interest is the reward received by the system which depends on the accuracy of the last update received by the destination and the Age-of-Information (AoI) of the system. We model our scheduling problem as a variant of the multi-arm bandit problem with sources as different arms. We compare the performance of all $4$ standard bandit policies, namely, ETC, $\epsilon$-greedy, UCB, and TS suitably adjusted to our system model via simulations. In addition, we provide analytical guarantees of $2$ of these policies, ETC, and $\epsilon$-greedy. Finally, we characterize the lower bound on the cumulative regret achievable by any policy.
Minimax Weight Learning for Absorbing MDPs
Li, Fengyin, Li, Yuqiang, Wu, Xianyi
Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon MDPs. In this paper, we study undiscounted off-policy policy evaluation for absorbing MDPs. Given the dataset consisting of the i.i.d episodes with a given truncation level, we propose a so-called MWLA algorithm to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound for the MWLA method is investigated and the dependence of statistical errors on the data size and the truncation level are analyzed. With an episodic taxi environment, computational experiments illustrate the performance of the MWLA algorithm.
Multi-Step Greedy and Approximate Real Time Dynamic Programming
Efroni, Yonathan, Ghavamzadeh, Mohammad, Mannor, Shie
Real Time Dynamic Programming (RTDP) is a well-known Dynamic Programming (DP) based algorithm that combines planning and learning to find an optimal policy for an MDP. It is a planning algorithm because it uses the MDP's model (reward and transition functions) to calculate a 1-step greedy policy w.r.t.~an optimistic value function, by which it acts. It is a learning algorithm because it updates its value function only at the states it visits while interacting with the environment. As a result, unlike DP, RTDP does not require uniform access to the state space in each iteration, which makes it particularly appealing when the state space is large and simultaneously updating all the states is not computationally feasible. In this paper, we study a generalized multi-step greedy version of RTDP, which we call $h$-RTDP, in its exact form, as well as in three approximate settings: approximate model, approximate value updates, and approximate state abstraction. We analyze the sample, computation, and space complexities of $h$-RTDP and establish that increasing $h$ improves sample and space complexity, with the cost of additional offline computational operations. For the approximate cases, we prove that the asymptotic performance of $h$-RTDP is the same as that of a corresponding approximate DP -- the best one can hope for without further assumptions on the approximation errors. $h$-RTDP is the first algorithm with a provably improved sample complexity when increasing the lookahead horizon.