Goto

Collaborating Authors

 Gradient Descent


Two minutes NLP -- Doc2Vec in a nutshell

#artificialintelligence

Doc2Vec is an unsupervised algorithm that learns embeddings from variable-length pieces of texts, such as sentences, paragraphs, and documents. It's originally presented in the paper Distributed Representations of Sentences and Documents. Let's review Word2Vec first, as it provides the inspiration for the Doc2Vec algorithm. Word2Vec learns word vectors by predicting a word in a sentence using the other words in the context. In this framework, every word is mapped to a unique vector, represented by a column in a matrix W. The concatenation or sum of the vectors is then used as features for the prediction of the next word in a sentence. The word vectors are trained using stochastic gradient descent.


Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias

arXiv.org Machine Learning

Variational Autoencoders (VAEs) are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower dimensional manifold. Recent work by Dai and Wipf (2019) suggests that on low-dimensional data, the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. In this paper, via a combination of theoretical and empirical results, we show that the story is more subtle. Precisely, we show that for linear encoders/decoders, the story is mostly true and VAE training does recover a generator with support equal to the ground truth manifold, but this is due to the implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that the VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.


Understanding Gradient Descent with simple mathematical intuition

#artificialintelligence

In simple language, the gradient descent is an ML optimization strategy which facilitates the ML model to find the minimum loss (cost) function which relates to the optimal variable parameters. Let's understand the concept in detail by applying it on one of the most regression algorithms that ML engineers and Data scientists use, that is Linear Regression. For ease of understanding Gradient descent, we will use simple or univariate linear regression. Here, we are focusing on determining the relationship between one independent variable and one dependent variable(target variable). A univariate linear regression is mathematically represented by y mx c, where'y' is the dependent variable and'x' is the independent or target variable.


Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD

arXiv.org Machine Learning

Stochastic gradient descent (SGD) is a cornerstone of machine learning. When the number N of data items is large, SGD relies on constructing an unbiased estimator of the gradient of the empirical risk using a small subset of the original dataset, called a minibatch. Default minibatch construction involves uniformly sampling a subset of the desired size, but alternatives have been explored for variance reduction. In particular, experimental evidence suggests drawing minibatches from determinantal point processes (DPPs), distributions over minibatches that favour diversity among selected items. However, like in recent work on DPPs for coresets, providing a systematic and principled understanding of how and why DPPs help has been difficult. In this work, we contribute an orthogonal polynomial-based DPP paradigm for minibatch sampling in SGD. Our approach leverages the specific data distribution at hand, which endows it with greater sensitivity and power over existing data-agnostic methods. We substantiate our method via a detailed theoretical analysis of its convergence properties, interweaving between the discrete data set and the underlying continuous domain. In particular, we show how specific DPPs and a string of controlled approximations can lead to gradient estimators with a variance that decays faster with the batchsize than under uniform sampling. Coupled with existing finite-time guarantees for SGD on convex objectives, this entails that, DPP minibatches lead to a smaller bound on the mean square approximation error than uniform minibatches. Moreover, our estimators are amenable to a recent algorithm that directly samples linear statistics of DPPs (i.e., the gradient estimator) without sampling the underlying DPP (i.e., the minibatch), thereby reducing computational overhead. We provide detailed synthetic as well as real data experiments to substantiate our theoretical claims.


Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity

arXiv.org Machine Learning

Gradient descent ascent (GDA), the simplest single-loop algorithm for nonconvex minimax optimization, is widely used in practical applications such as generative adversarial networks (GANs) and adversarial training. Albeit its desirable simplicity, recent work shows inferior convergence rates of GDA in theory even assuming strong concavity of the objective on one side. This paper establishes new convergence results for two alternative single-loop algorithms -- alternating GDA and smoothed GDA -- under the mild assumption that the objective satisfies the Polyak-Lojasiewicz (PL) condition about one variable. We prove that, to find an $\epsilon$-stationary point, (i) alternating GDA and its stochastic variant (without mini batch) respectively require $O(\kappa^{2} \epsilon^{-2})$ and $O(\kappa^{4} \epsilon^{-4})$ iterations, while (ii) smoothed GDA and its stochastic variant (without mini batch) respectively require $O(\kappa \epsilon^{-2})$ and $O(\kappa^{2} \epsilon^{-4})$ iterations. The latter greatly improves over the vanilla GDA and gives the hitherto best known complexity results among single-loop algorithms under similar settings. We further showcase the empirical efficiency of these algorithms in training GANs and robust nonlinear regression.


On Convergence of Federated Averaging Langevin Dynamics

arXiv.org Machine Learning

We propose a federated averaging Langevin algorithm (FA-LD) for uncertainty quantification and mean predictions with distributed clients. In particular, we generalize beyond normal posterior distributions and consider a general class of models. We develop theoretical guarantees for FA-LD for strongly log-concave distributions with non-i.i.d data and study how the injected noise and the stochastic-gradient noise, the heterogeneity of data, and the varying learning rates affect the convergence. Such an analysis sheds light on the optimal choice of local updates to minimize communication costs. Important to our approach is that the communication efficiency does not deteriorate with the injected noise in the Langevin algorithms. In addition, we examine in our FA-LD algorithm both independent and correlated noise used over different clients. We observe that there is also a trade-off between federation and communication cost there. As local devices may become inactive in the federated network, we also show convergence results based on different averaging schemes where only partial device updates are available.


Asymptotic properties of one-layer artificial neural networks with sparse connectivity

arXiv.org Machine Learning

A law of large numbers for the empirical distribution of parameters of a one-layer artificial neural networks with sparse connectivity is derived for a simultaneously increasing number of both, neurons and training iterations of the stochastic gradient descent.


Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum

arXiv.org Machine Learning

Stochastic gradient descent with momentum (SGDM) is the dominant algorithm in many optimization scenarios, including convex optimization instances and non-convex neural network training. Yet, in the stochastic setting, momentum interferes with gradient noise, often leading to specific step size and momentum choices in order to guarantee convergence, set aside acceleration. Proximal point methods, on the other hand, have gained much attention due to their numerical stability and elasticity against imperfect tuning. Their stochastic accelerated variants though have received limited attention: how momentum interacts with the stability of (stochastic) proximal point methods remains largely unstudied. To address this, we focus on the convergence and stability of the stochastic proximal point algorithm with momentum (SPPAM), and show that SPPAM allows a faster linear convergence to a neighborhood compared to stochastic proximal point algorithm (SPPA) with a better contraction factor, under proper hyperparameter tuning. In terms of stability, we show that SPPAM depends on problem constants more favorably than SGDM, allowing a wider range of step size and momentum that lead to convergence.


A project of upgrading a simple chatbot

#artificialintelligence

I initially wanted to learn how chatbot works and created a simple one on my own. However, there's a surprising amount of available codes that create chatbots of all levels. My task was to learn from this very simple chatbot model that uses straightforward NN of x and y, using stochastic gradient descent to predict the intention of the writer and generate an answer accordingly. As x is the input text, and y is the intention, the author marks 0 or 1 to the input texts. 1 is close to the intention, and 0 is the opposite. My idea of improvement is instead of marking 0 or 1 to the texts, I would implement words to vectors to create one hot encoded (X, Y) matrices, then feed the encoded words into the model.


A Continuous-time Stochastic Gradient Descent Method for Continuous Data

arXiv.org Artificial Intelligence

Optimization problems with continuous data appear in, e.g., robust machine learning, functional data analysis, and variational inference. Here, the target function is given as an integral over a family of (continuously) indexed target functions - integrated with respect to a probability measure. Such problems can often be solved by stochastic optimization methods: performing optimization steps with respect to the indexed target function with randomly switched indices. In this work, we study a continuous-time variant of the stochastic gradient descent algorithm for optimization problems with continuous data. This so-called stochastic gradient process consists in a gradient flow minimizing an indexed target function that is coupled with a continuous-time index process determining the index. Index processes are, e.g., reflected diffusions, pure jump processes, or other L\'evy processes on compact spaces. Thus, we study multiple sampling patterns for the continuous data space and allow for data simulated or streamed at runtime of the algorithm. We analyze the approximation properties of the stochastic gradient process and study its longtime behavior and ergodicity under constant and decreasing learning rates. We end with illustrating the applicability of the stochastic gradient process in a polynomial regression problem with noisy functional data, as well as in a physics-informed neural network.