Allen-Zhu, Zeyuan, Simchi-Levi, David, Wang, Xinshang

Classically, the time complexity of a first-order method is estimated by its number of gradient computations. In this paper, we study a more refined complexity by taking into account the lingering'' of gradients: once a gradient is computed at $x_k$, the additional time to compute gradients at $x_{k 1},x_{k 2},\dots$ may be reduced. We show how this improves the running time of gradient descent and SVRG. For instance, if the "additional time'' scales linearly with respect to the traveled distance, then the "convergence rate'' of gradient descent can be improved from $1/T$ to $\exp(-T {1/3})$. On the empirical side, we solve a hypothetical revenue management problem on the Yahoo!

Now, what was the Gradient Descent algorithm? Above algorithm says, to perform the GD, we need to calculate the gradient of the cost function J. And to calculate the gradient of the cost function, we need to sum (yellow circle!) the cost of each sample. If we have 3 million samples, we have to loop through 3 million times or use the dot product. If you insist to use GD.

Gradient descent is the most commonly used optimization method deployed in machine learning and deep learning algorithms. It's used to train a machine learning model and is based on a convex function. It does this to minimize a given cost function to its local minimum. Gradient descent was invented by French mathematician Louis Augustin Cauchy in 1847. Most machine learning and deep learning algorithms involve some sort of optimization.

The goal of policy gradient approaches is to find a policy in a given class of policies which maximizes the expected return. Given a differentiable model of the policy, we want to apply a gradient-ascent technique to reach a local optimum. We mainly use gradient ascent, because it is theoretically well researched. The main issue is that the policy gradient with respect to the expected return is not available, thus we need to estimate it. As policy gradient algorithms also tend to require on-policy data for the gradient estimate, their biggest weakness is sample efficiency. For this reason, most research is focused on finding algorithms with improved sample efficiency. This paper provides a formal introduction to policy gradient that shows the development of policy gradient approaches, and should enable the reader to follow current research on the topic.

Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (a) using the past experience to estimate {\em only} the gradient of the expected return $U(\theta)$ at the current policy parameterization $\theta$, rather than to obtain a more complete estimate of $U(\theta)$, and (b) using past experience under the current policy {\em only} rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines---a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds.