Goto

Collaborating Authors

 minkowski


Improved Central Limit Theorem and Bootstrap Approximations for Linear Stochastic Approximation

Butyrin, Bogdan, Moulines, Eric, Naumov, Alexey, Samsonov, Sergey, Shao, Qi-Man, Zhang, Zhuo-Song

arXiv.org Machine Learning

In this paper, we refine the Berry-Esseen bounds for the multivariate normal approximation of Polyak-Ruppert averaged iterates arising from the linear stochastic approximation (LSA) algorithm with decreasing step size. We consider the normal approximation by the Gaussian distribution with covariance matrix predicted by the Polyak-Juditsky central limit theorem and establish the rate up to order $n^{-1/3}$ in convex distance, where $n$ is the number of samples used in the algorithm. We also prove a non-asymptotic validity of the multiplier bootstrap procedure for approximating the distribution of the rescaled error of the averaged LSA estimator. We establish approximation rates of order up to $1/\sqrt{n}$ for the latter distribution, which significantly improves upon the previous results obtained by Samsonov et al. (2024).


Gaussian Approximation for Two-Timescale Linear Stochastic Approximation

Butyrin, Bogdan, Rubtsov, Artemy, Naumov, Alexey, Ulyanov, Vladimir, Samsonov, Sergey

arXiv.org Machine Learning

In this paper, we establish non-asymptotic bounds for accuracy of normal approximation for linear two-timescale stochastic approximation (TTSA) algorithms driven by martingale difference or Markov noise. Focusing on both the last iterate and Polyak-Ruppert averaging regimes, we derive bounds for normal approximation in terms of the convex distance between probability distributions. Our analysis reveals a non-trivial interaction between the fast and slow timescales: the normal approximation rate for the last iterate improves as the timescale separation increases, while it decreases in the Polyak-Ruppert averaged setting. We also provide the high-order moment bounds for the error of linear TTSA algorithm, which may be of independent interest.


High-Order Error Bounds for Markovian LSA with Richardson-Romberg Extrapolation

Levin, Ilya, Naumov, Alexey, Samsonov, Sergey

arXiv.org Machine Learning

In this paper, we study the bias and high-order error bounds of the Linear Stochastic Approximation (LSA) algorithm with Polyak-Ruppert (PR) averaging under Markovian noise. We focus on the version of the algorithm with constant step size $α$ and propose a novel decomposition of the bias via a linearization technique. We analyze the structure of the bias and show that the leading-order term is linear in $α$ and cannot be eliminated by PR averaging. To address this, we apply the Richardson-Romberg (RR) extrapolation procedure, which effectively cancels the leading bias term. We derive high-order moment bounds for the RR iterates and show that the leading error term aligns with the asymptotically optimal covariance matrix of the vanilla averaged LSA iterates.


A note on concentration inequalities for the overlapped batch mean variance estimators for Markov chains

Moulines, Eric, Naumov, Alexey, Samsonov, Sergey

arXiv.org Machine Learning

In this paper, we study the concentration properties of quadratic forms associated with Markov chains using the martingale decomposition method introduced by Atchadé and Cattaneo (2014). In particular, we derive concentration inequalities for the overlapped batch mean (OBM) estimators of the asymptotic variance for uniformly geometrically ergodic Markov chains. Our main result provides an explicit control of the $p$-th moment of the difference between the OBM estimator and the asymptotic variance of the Markov chain with explicit dependence upon $p$ and mixing time of the underlying Markov chain.


Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation

Sheshukova, Marina, Belomestny, Denis, Durmus, Alain, Moulines, Eric, Naumov, Alexey, Samsonov, Sergey

arXiv.org Machine Learning

We address the problem of solving strongly convex and smooth minimization problems using stochastic gradient descent (SGD) algorithm with a constant step size. Previous works suggested to combine the Polyak-Ruppert averaging procedure with the Richardson-Romberg extrapolation technique to reduce the asymptotic bias of SGD at the expense of a mild increase of the variance. We significantly extend previous results by providing an expansion of the mean-squared error of the resulting estimator with respect to the number of iterations $n$. More precisely, we show that the mean-squared error can be decomposed into the sum of two terms: a leading one of order $\mathcal{O}(n^{-1/2})$ with explicit dependence on a minimax-optimal asymptotic covariance matrix, and a second-order term of order $\mathcal{O}(n^{-3/4})$ where the power $3/4$ can not be improved in general. We also extend this result to the $p$-th moment bound keeping optimal scaling of the remainders with respect to $n$. Our analysis relies on the properties of the SGD iterates viewed as a time-homogeneous Markov chain. In particular, we establish that this chain is geometrically ergodic with respect to a suitably defined weighted Wasserstein semimetric.


Rosenthal-type inequalities for linear statistics of Markov chains

Durmus, Alain, Moulines, Eric, Naumov, Alexey, Samsonov, Sergey, Sheshukova, Marina

arXiv.org Machine Learning

Probability and moment inequalities for sums of random variables are of paramount importance in the complexity analysis of numerous stochastic approximation algorithms or finite-time analysis of Monte Carlo estimators; see [20], [10], and references therein. The main focus in this area has been on concentration inequalities for independent random variable sums or martingale difference sequences; see e.g. in [4, 36]. However, the study of concentration inequalities for additive Markov chain functions is still relatively underdeveloped. For the technically simple case of uniformly ergodic Markov chains, there is extensive work on Hoeffding-and Bernstein-like inequalities as found in [23, 34, 20, 38]. Nevertheless, the application of these results may be difficult due to a lack of quantitative data or the substitution of asymptotic variance of the chain by surrogates; see Section 2.1 for relevant definitions. The present work aims to fill this gap by extending Rosenthal-and Bernstein-type inequalities to Markov chains which converge geometrically fast to a unique invariant distribution, with an explicit emphasis on the mixing time of the underlying Markov chain. An important tool for establishing deviation bounds for sums of random variables is based on moment inequalities.


Finite-time High-probability Bounds for Polyak-Ruppert Averaged Iterates of Linear Stochastic Approximation

Durmus, Alain, Moulines, Eric, Naumov, Alexey, Samsonov, Sergey

arXiv.org Artificial Intelligence

The LSA algorithm is central in statistics, machine learning, and linear systems identification, see e.g. the works Eweda and Macchi (1983); Widrow and Stearns (1985); Benveniste et al. (2012); Kushner and Yin (2003) and references therein. More recently, it has sparked a renewed interest in machine learning, especially for high-dimensional least squares and reinforcement learning (RL) problems; Bertsekas and Tsitsiklis (2003); Bottou et al. (2018); Sutton (1988); Bertsekas (2019); Watkins and Dayan (1992). The LSA and LSA-PR recursions (1) have been the subject of a wealth of work, and it is difficult to adequately acknowledge all contributions. Polyak and Juditsky (1992); Kushner and Yin (2003); Borkar (2008); Benveniste et al. (2012) provided asymptotic convergence guarantees (almost sure convergence, central limit theorem) under both i.i.d. and Markovian noise settings. In particular, it has been established that LSA-PR can accelerate LSA and satisfies a central limit theorem with an asymptotically minimax-optimal covariance matrix. Although asymptotic convergence analysis is of theoretical interest, the current trend is to obtain nonasymptotic guarantees that take into account both the limited sample size and the dimension of the parameter space. For these reasons, non-asymptotic analysis of both i.i.d. and Markovian SA procedures has recently attracted much attention.


On learning parametric distributions from quantized samples

Sarbu, Septimia, Zaidi, Abdellatif

arXiv.org Artificial Intelligence

We consider the problem of learning parametric distributions from their quantized samples in a network. Specifically, $n$ agents or sensors observe independent samples of an unknown parametric distribution; and each of them uses $k$ bits to describe its observed sample to a central processor whose goal is to estimate the unknown distribution. First, we establish a generalization of the well-known van Trees inequality to general $L_p$-norms, with $p > 1$, in terms of Generalized Fisher information. Then, we develop minimax lower bounds on the estimation error for two losses: general $L_p$-norms and the related Wasserstein loss from optimal transport.


8 Machine Learning Algorithms in Python - You Must Learn - DataFlair

#artificialintelligence

Previously, we discussed the techniques of machine learning with Python. Going deeper, today, we will learn and implement 8 top Machine Learning Algorithms in Python. Let's begin the journey of Machine Learning Algorithms in Python Programming. Linear regression is one of the supervised Machine learning algorithms in Python that observes continuous features and predicts an outcome. Depending on whether it runs on a single variable or on many features, we can call it simple linear regression or multiple linear regression.


The statistical Minkowski distances: Closed-form formula for Gaussian Mixture Models

Nielsen, Frank

arXiv.org Machine Learning

The traditional Minkowski distances are induced by the corresponding Minkowski norms in real-valued vector spaces. In this work, we propose novel statistical symmetric distances based on the Minkowski's inequality for probability densities belonging to Lebesgue spaces. These statistical Minkowski distances admit closed-form formula for Gaussian mixture models when parameterized by integer exponents. This result extends to arbitrary mixtures of exponential families with natural parameter spaces being cones: This includes the binomial, the multinomial, the zero-centered Laplacian, the Gaussian and the Wishart mixtures, among others. We also derive a Minkowski's diversity index of a normalized weighted set of probability distributions from Minkowski's inequality.