Goto

Collaborating Authors

 Gradient Descent


What Can Machine Learning Teach Us about Communications?

arXiv.org Machine Learning

Rapid improvements in machine learning over the past decade are beginning to have far-reaching effects. For communications, engineers with limited domain expertise can now use off-the-shelf learning packages to design high-performance systems based on simulations. Prior to the current revolution in machine learning, the majority of communication engineers were quite aware that system parameters (such as filter coefficients) could be learned using stochastic gradient descent. It was not at all clear, however, that more complicated parts of the system architecture could be learned as well. In this paper, we discuss the application of machine-learning techniques to two communications problems and focus on what can be learned from the resulting systems. We were pleasantly surprised that the observed gains in one example have a simple explanation that only became clear in hindsight. In essence, deep learning discovered a simple and effective strategy that had not been considered earlier.


Stochastic Gradient Trees

arXiv.org Machine Learning

We present an online algorithm that induces decision trees using gradient information as the source of supervision. In contrast to previous approaches to gradient-based tree learning, we do not require soft splits or construction of a new tree for every update. In experiments, our method performs comparably to standard incremental classification trees and outperforms state of the art incremental regression trees. We also show how the method can be used to construct a novel type of neural network layer suited to learning representations from tabular data and find that it increases accuracy of multiclass and multi-label classification.


How to Configure the Learning Rate Hyperparameter When Training Deep Learning Neural Networks Plow

#artificialintelligence

The weights of a neural network cannot be calculated using an analytical method. Instead, the weights must be discovered via an empirical optimization procedure called stochastic gradient descent.


Difference between Batch Gradient Descent and Stochastic Gradient Descent

#artificialintelligence

Now, what was the Gradient Descent algorithm? Above algorithm says, to perform the GD, we need to calculate the gradient of the cost function J. And to calculate the gradient of the cost function, we need to sum (yellow circle!) the cost of each sample. If we have 3 million samples, we have to loop through 3 million times or use the dot product. Do you see np.dot(X.T, y_hat-y) above?


Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization

arXiv.org Machine Learning

Recent studies on diffusion-based sampling methods have shown that Langevin Monte Carlo (LMC) algorithms can be beneficial for non-convex optimization, and rigorous theoretical guarantees have been proven for both asymptotic and finite-time regimes. Algorithmically, LMC-based algorithms resemble the well-known gradient descent (GD) algorithm, where the GD recursion is perturbed by an additive Gaussian noise whose variance has a particular form. Fractional Langevin Monte Carlo (FLMC) is a recently proposed extension of LMC, where the Gaussian noise is replaced by a heavy-tailed {\alpha}-stable noise. As opposed to its Gaussian counterpart, these heavy-tailed perturbations can incur large jumps and it has been empirically demonstrated that the choice of {\alpha}-stable noise can provide several advantages in modern machine learning problems, both in optimization and sampling contexts. However, as opposed to LMC, only asymptotic convergence properties of FLMC have been yet established. In this study, we analyze the non-asymptotic behavior of FLMC for non-convex optimization and prove finite-time bounds for its expected suboptimality. Our results show that the weak-error of FLMC increases faster than LMC, which suggests using smaller step-sizes in FLMC. We finally extend our results to the case where the exact gradients are replaced by stochastic gradients and show that similar results hold in this setting as well.


An Exact Reformulation of Feature-Vector-based Radial-Basis-Function Networks for Graph-based Observations

arXiv.org Machine Learning

Radial-basis-function networks are traditionally defined for sets of vector-based observations. In this short paper, we reformulate such networks so that they can be applied to adjacency-matrix representations of weighted, directed graphs that represent the relationships between object pairs. We re-state the sum-of-squares objective function so that it is purely dependent on entries from the adjacency matrix. From this objective function, we derive a gradient descent update for the network weights. We also derive a gradient update that simulates the repositioning of the radial basis prototypes and changes in the radial basis prototype parameters. An important property of our radial basis function networks is that they are guaranteed to yield the same responses as conventional radial-basis networks trained on a corresponding vector realization of the relationships encoded by the adjacency-matrix. Such a vector realization only needs to provably exist for this property to hold, which occurs whenever the relationships correspond to distances from some arbitrary metric applied to a latent set of vectors. We therefore completely avoid needing to actually construct vectorial realizations via multi-dimensional scaling, which ensures that the underlying relationships are totally preserved.


r/MachineLearning - [D] Gradient Descent on (deterministic) Mean Absolute Error (L1 loss)

#artificialintelligence

Gradient-based optimization of absolute errors is tricky, since the gradient is "never" zero. In theory, adaptive methods should be able to damp oscillations so that it converges to the minimum. However, I found none of the'standard' methods were able to do this "out of the box". Learning rate decay could alleviate the problem, but needs manual tuning which I would rather avoid. Does anyone know of a method that can do this?


A Deterministic Approach to Avoid Saddle Points

arXiv.org Machine Learning

Loss functions with a large number of saddle points are one of the main obstacles to training many modern machine learning models. Gradient descent (GD) is a fundamental algorithm for machine learning and converges to a saddle point for certain initial data. We call the region formed by these initial values the "attraction region." For quadratic functions, GD converges to a saddle point if the initial data is in a subspace of up to n-1 dimensions. In this paper, we prove that a small modification of the recently proposed Laplacian smoothing gradient descent (LSGD) [Osher, et al., arXiv:1806.06317] contributes to avoiding saddle points without sacrificing the convergence rate of GD. In particular, we show that the dimension of the LSGD's attraction region is at most floor((n-1)/2) for a class of quadratic functions which is significantly smaller than GD's (n-1)-dimensional attraction region.


pushpull13/Gradient-Descent-Scratch

#artificialintelligence

It is a gradient descent algorithm for classification implemented from scratch using numpy library. It is good practice to shuffle data at first numpy.random.shuffle() Mini Batch Size is size of input data flowing through network at a time for calculating error as a whole Learning Rate Alpha decides the rate at which, weights and biases will update while back propagation Number of Epochs decides number of times, the whole dataset will be used to train the network Set Mini Batch Size to 1/10th of total data available. And update it manually after every train of network to find its optimum value Alpha should be selected such that learning isn't very slow as well as it didn't take long jump or else, network will start diverging from local minima Number of epochs are selected such that network don't overfit itself over noise In ANN, output will depend on every neuron it pass through For output layer, we have label according to which, it is possible to find it's expected value But for all other layers, there is no single solution available So, finding optimum value is little harder for that


Fitting ReLUs via SGD and Quantized SGD

arXiv.org Machine Learning

In this paper we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). These functions are of the form $\mathbf{x}\rightarrow \max(0,\langle\mathbf{w},\mathbf{x}\rangle)$ with $\mathbf{w}\in\mathbb{R}^d$ denoting the weight vector. We focus on a planted model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to a planted weight vector. We first show that mini-batch stochastic gradient descent when suitably initialized, converges at a geometric rate to the planted model with a number of samples that is optimal up to numerical constants. Next we focus on a parallel implementation where in each iteration the mini-batch gradient is calculated in a distributed manner across multiple processors and then broadcast to a master or all other processors. To reduce the communication cost in this setting we utilize a Quanitzed Stochastic Gradient Scheme (QSGD) where the partial gradients are quantized. Perhaps unexpectedly, we show that QSGD maintains the fast convergence of SGD to a globally optimal model while significantly reducing the communication cost. We further corroborate our numerical findings via various experiments including distributed implementations over Amazon EC2.