Gradient Descent
Theoretical Convergence Guarantees for Variational Autoencoders
Surendran, Sobihan, Godichon-Baggioni, Antoine, Corff, Sylvain Le
Variational Autoencoders (VAE) are popular generative models used to sample from complex data distributions. Despite their empirical success in various machine learning tasks, significant gaps remain in understanding their theoretical properties, particularly regarding convergence guarantees. This paper aims to bridge that gap by providing non-asymptotic convergence guarantees for VAE trained using both Stochastic Gradient Descent and Adam algorithms.We derive a convergence rate of $\mathcal{O}(\log n / \sqrt{n})$, where $n$ is the number of iterations of the optimization algorithm, with explicit dependencies on the batch size, the number of variational samples, and other key hyperparameters. Our theoretical analysis applies to both Linear VAE and Deep Gaussian VAE, as well as several VAE variants, including $\beta$-VAE and IWAE. Additionally, we empirically illustrate the impact of hyperparameters on convergence, offering new insights into the theoretical understanding of VAE training.
Global Optimization of Gaussian Process Acquisition Functions Using a Piecewise-Linear Kernel Approximation
Xie, Yilin, Zhang, Shiqiang, Paulson, Joel, Tsay, Calvin
Bayesian optimization relies on iteratively constructing and optimizing an acquisition function. The latter turns out to be a challenging, non-convex optimization problem itself. Despite the relative importance of this step, most algorithms employ sampling- or gradient-based methods, which do not provably converge to global optima. This work investigates mixed-integer programming (MIP) as a paradigm for \textit{global} acquisition function optimization. Specifically, our Piecewise-linear Kernel Mixed Integer Quadratic Programming (PK-MIQP) formulation introduces a piecewise-linear approximation for Gaussian process kernels and admits a corresponding MIQP representation for acquisition functions. We analyze the theoretical regret bounds of the proposed approximation, and empirically demonstrate the framework on synthetic functions, constrained benchmarks, and a hyperparameter tuning task.
Error estimates between SGD with momentum and underdamped Langevin diffusion
Guillin, Arnaud, Wang, Yu, Xu, Lihu, Yang, Haoran
Stochastic gradient descent with momentum is a popular variant of stochastic gradient descent, which has recently been reported to have a close relationship with the underdamped Langevin diffusion. In this paper, we establish a quantitative error estimate between them in the 1-Wasserstein and total variation distances.
Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum
Khirirat, Sarit, Sadiev, Abdurakhmon, Riabinin, Artem, Gorbunov, Eduard, Richtรกrik, Peter
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. Despite their popularity and efficiency in training deep neural networks, traditional analyses of error feedback algorithms rely on the smoothness assumption that does not capture the properties of objective functions in these problems. Rather, these problems have recently been shown to satisfy generalized smoothness assumptions, and the theoretical understanding of error feedback algorithms under these assumptions remains largely unexplored. Moreover, to the best of our knowledge, all existing analyses under generalized smoothness either i) focus on single-node settings or ii) make unrealistically strong assumptions for distributed settings, such as requiring data heterogeneity, and almost surely bounded stochastic gradient noise variance. In this paper, we propose distributed error feedback algorithms that utilize normalization to achieve the $O(1/\sqrt{K})$ convergence rate for nonconvex problems under generalized smoothness. Our analyses apply for distributed settings without data heterogeneity conditions, and enable stepsize tuning that is independent of problem parameters. Additionally, we provide strong convergence guarantees of normalized error feedback algorithms for stochastic settings. Finally, we show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks, including the minimization of polynomial functions, logistic regression, and ResNet-20 training.
Large Deviations and Improved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry
Armacki, Aleksandar, Yu, Shuhua, Bajovic, Dragana, Jakovetic, Dusan, Kar, Soummya
We study large deviations and mean-squared error (MSE) guarantees of a general framework of nonlinear stochastic gradient methods in the online setting, in the presence of heavy-tailed noise. Unlike existing works that rely on the closed form of a nonlinearity (typically clipping), our framework treats the nonlinearity in a black-box manner, allowing us to provide unified guarantees for a broad class of bounded nonlinearities, including many popular ones, like sign, quantization, normalization, as well as component-wise and joint clipping. We provide several strong results for a broad range of step-sizes in the presence of heavy-tailed noise with symmetric probability density function, positive in a neighbourhood of zero and potentially unbounded moments. In particular, for non-convex costs we provide a large deviation upper bound for the minimum norm-squared of gradients, showing an asymptotic tail decay on an exponential scale, at a rate $\sqrt{t} / \log(t)$. We establish the accompanying rate function, showing an explicit dependence on the choice of step-size, nonlinearity, noise and problem parameters. Next, for non-convex costs and the minimum norm-squared of gradients, we derive the optimal MSE rate $\widetilde{\mathcal{O}}(t^{-1/2})$. Moreover, for strongly convex costs and the last iterate, we provide an MSE rate that can be made arbitrarily close to the optimal rate $\mathcal{O}(t^{-1})$, improving on the state-of-the-art results in the presence of heavy-tailed noise. Finally, we establish almost sure convergence of the minimum norm-squared of gradients, providing an explicit rate, which can be made arbitrarily close to $o(t^{-1/4})$.
Implicit Regularization for Tubal Tensor Factorizations via Gradient Descent
Karnik, Santhosh, Veselovska, Anna, Iwen, Mark, Krahmer, Felix
We provide a rigorous analysis of implicit regularization in an overparametrized tensor factorization problem beyond the lazy training regime. For matrix factorization problems, this phenomenon has been studied in a number of works. A particular challenge has been to design universal initialization strategies which provably lead to implicit regularization in gradient-descent methods. At the same time, it has been argued by Cohen et. al. 2016 that more general classes of neural networks can be captured by considering tensor factorizations. However, in the tensor case, implicit regularization has only been rigorously established for gradient flow or in the lazy training regime. In this paper, we prove the first tensor result of its kind for gradient descent rather than gradient flow. We focus on the tubal tensor product and the associated notion of low tubal rank, encouraged by the relevance of this model for image data. We establish that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank. Our theoretical findings are illustrated in an extensive set of numerical simulations show-casing the dynamics predicted by our theory as well as the crucial role of using a small random initialization.
Exact full-RSB SAT/UNSAT transition in infinitely wide two-layer neural networks
Annesi, Brandon L., Malatesta, Enrico M., Zamponi, Francesco
We analyze the problem of storing random pattern-label associations using two classes of continuous non-convex weights models, namely the perceptron with negative margin and an infinite-width two-layer neural network with non-overlapping receptive fields and generic activation function. Using a full-RSB ansatz we compute the exact value of the SAT/UNSAT transition. Furthermore, in the case of the negative perceptron we show that the overlap distribution of typical states displays an overlap gap (a disconnected support) in certain regions of the phase diagram defined by the value of the margin and the density of patterns to be stored. This implies that some recent theorems that ensure convergence of Approximate Message Passing (AMP) based algorithms to capacity are not applicable. Finally, we show that Gradient Descent is not able to reach the maximal capacity, irrespectively of the presence of an overlap gap for typical states. This finding, similarly to what occurs in binary weight models, suggests that gradient-based algorithms are biased towards highly atypical states, whose inaccessibility determines the algorithmic threshold.
Vulnerabilities in Machine Learning-Based Voice Disorder Detection Systems
Perelli, Gianpaolo, Panzino, Andrea, Casula, Roberto, Micheletto, Marco, Orrรน, Giulia, Marcialis, Gian Luca
The impact of voice disorders is becoming more widely acknowledged as a public health issue. Several machine learning-based classifiers with the potential to identify disorders have been used in recent studies to differentiate between normal and pathological voices and sounds. In this paper, we focus on analyzing the vulnerabilities of these systems by exploring the possibility of attacks that can reverse classification and compromise their reliability. Given the critical nature of personal health information, understanding which types of attacks are effective is a necessary first step toward improving the security of such systems. Starting from the original audios, we implement various attack methods, including adversarial, evasion, and pitching techniques, and evaluate how state-of-the-art disorder detection models respond to them. Our findings identify the most effective attack strategies, underscoring the need to address these vulnerabilities in machine-learning systems used in the healthcare domain.
A practical, fast method for solving sum-of-squares problems for very large polynomials
Keren, Daniel, Osadchy, Margarita, Poranne, Roi
Sum of squares (SOS) optimization is a powerful technique for solving problems where the positivity of a polynomials must be enforced. The common approach to solve an SOS problem is by relaxation to a Semidefinite Program (SDP). The main advantage of this transormation is that SDP is a convex problem for which efficient solvers are readily available. However, while considerable progress has been made in recent years, the standard approaches for solving SDPs are still known to scale poorly. Our goal is to devise an approach that can handle larger, more complex problems than is currently possible. The challenge indeed lies in how SDPs are commonly solved. State-Of-The-Art approaches rely on the interior point method, which requires the factorization of large matrices. We instead propose an approach inspired by polynomial neural networks, which exhibit excellent performance when optimized using techniques from the deep learning toolbox. In a somewhat counter-intuitive manner, we replace the convex SDP formulation with a non-convex, unconstrained, and \emph{over parameterized} formulation, and solve it using a first order optimization method. It turns out that this approach can handle very large problems, with polynomials having over four million coefficients, well beyond the range of current SDP-based approaches. Furthermore, we highlight theoretical and practical results supporting the experimental success of our approach in avoiding spurious local minima, which makes it amenable to simple and fast solutions based on gradient descent. In all the experiments, our approach had always converged to a correct global minimum, on general (non-sparse) polynomials, with running time only slightly higher than linear in the number of polynomial coefficients, compared to higher than quadratic in the number of coefficients for SDP-based methods.
On Divergence Measures for Training GFlowNets
da Silva, Tiago, da Silva, Eliezer de Souza, Mesquita, Diego
Generative Flow Networks (GFlowNets) are amortized inference models designed to sample from unnormalized distributions over composable objects, with applications in generative modeling for tasks in fields such as causal discovery, NLP, and drug discovery. Traditionally, the training procedure for GFlowNets seeks to minimize the expected log-squared difference between a proposal (forward policy) and a target (backward policy) distribution, which enforces certain flow-matching conditions. While this training procedure is closely related to variational inference (VI), directly attempting standard Kullback-Leibler (KL) divergence minimization can lead to proven biased and potentially high-variance estimators. Therefore, we first review four divergence measures, namely, Renyi-$\alpha$'s, Tsallis-$\alpha$'s, reverse and forward KL's, and design statistically efficient estimators for their stochastic gradients in the context of training GFlowNets. Then, we verify that properly minimizing these divergences yields a provably correct and empirically effective training scheme, often leading to significantly faster convergence than previously proposed optimization. To achieve this, we design control variates based on the REINFORCE leave-one-out and score-matching estimators to reduce the variance of the learning objectives' gradients. Our work contributes by narrowing the gap between GFlowNets training and generalized variational approximations, paving the way for algorithmic ideas informed by the divergence minimization viewpoint.