mgf
- North America > Canada > Alberta (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (5 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Law (0.67)
- Asia > China > Shanghai > Shanghai (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (2 more...)
- North America > Canada > Alberta (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (5 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Law (0.67)
- Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.67)
- Asia > China > Shanghai > Shanghai (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (2 more...)
Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks
Papazov, Hristo, Pesme, Scott, Flammarion, Nicolas
In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (6 more...)
Concentration of the Langevin Algorithm's Stationary Distribution
Altschuler, Jason M., Talwar, Kunal
A canonical algorithm for log-concave sampling is the Langevin Algorithm, aka the Langevin Diffusion run with some discretization stepsize $\eta > 0$. This discretization leads the Langevin Algorithm to have a stationary distribution $\pi_{\eta}$ which differs from the stationary distribution $\pi$ of the Langevin Diffusion, and it is an important challenge to understand whether the well-known properties of $\pi$ extend to $\pi_{\eta}$. In particular, while concentration properties such as isoperimetry and rapidly decaying tails are classically known for $\pi$, the analogous properties for $\pi_{\eta}$ are open questions with direct algorithmic implications. This note provides a first step in this direction by establishing concentration results for $\pi_{\eta}$ that mirror classical results for $\pi$. Specifically, we show that for any nontrivial stepsize $\eta > 0$, $\pi_{\eta}$ is sub-exponential (respectively, sub-Gaussian) when the potential is convex (respectively, strongly convex). Moreover, the concentration bounds we show are essentially tight. Key to our analysis is the use of a rotation-invariant moment generating function (aka Bessel function) to study the stationary dynamics of the Langevin Algorithm. This technique may be of independent interest because it enables directly analyzing the discrete-time stationary distribution $\pi_{\eta}$ without going through the continuous-time stationary distribution $\pi$ as an intermediary.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Middle East > Jordan (0.04)
The Implicit Regularization of Momentum Gradient Descent with Early Stopping
Wang, Li, Zhou, Yingcong, Fu, Zhiguo
The study on the implicit regularization induced by gradient-based optimization is a longstanding pursuit. In the present paper, we characterize the implicit regularization of momentum gradient descent (MGD) with early stopping by comparing with the explicit $\ell_2$-regularization (ridge). In details, we study MGD in the continuous-time view, so-called momentum gradient flow (MGF), and show that its tendency is closer to ridge than the gradient descent (GD) [Ali et al., 2019] for least squares regression. Moreover, we prove that, under the calibration $t=\sqrt{2/\lambda}$, where $t$ is the time parameter in MGF and $\lambda$ is the tuning parameter in ridge regression, the risk of MGF is no more than 1.54 times that of ridge. In particular, the relative Bayes risk of MGF to ridge is between 1 and 1.035 under the optimal tuning. The numerical experiments support our theoretical results strongly.
Moment Generating Function Tutorial
We generally use moments in statistics, machine learning, mathematics, and other fields to describe the characteristics of a distribution. Let's say the variable of our interest is X then, moments are X's expected values. Now we are very familiar with the first moment(mean) and the second moment(variance). The third moment is called skewness, and the fourth moment is known as kurtosis. The third moment measures the asymmetry of distribution while the fourth moment measures how heavy the tail values are.
Concentration Inequalities for Statistical Inference
This paper gives a review of concentration inequalities which are widely employed in analyzes of mathematical statistics in a wide range of settings, from distribution free to distribution dependent, from sub-Gaussian to sub-exponential, sub-Gamma, and sub-Weibull random variables, and from the mean to the maximum concentration. This review provides results in these settings with some fresh new results. Given the increasing popularity of high dimensional data and inference, results in the context of high-dimensional linear and Poisson regressions are also provided. We aim to illustrate the concentration inequalities with known constants and to improve existing bounds with sharper constants.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (5 more...)
Moment Generating Function for Probability Distribution with Python
This tutorial's code is available on Github and its full implementation as well on Google Colab. Check out our editorial suggestions on the best data science books. We generally use moments in statistics, machine learning, mathematics, and other fields to describe the characteristics of a distribution. Let's say the variable of our interest is X then, moments are X's expected values. Now we are very familiar with the first moment(mean) and the second moment(variance).