Gradient Descent
Guide to Gradient Descent in Machine Learning
Gradient Descent is an iterative optimization algorithm in machine learning that aims to minimize the cost function by finding the global minimum. In this way, it helps machine learning models make more accurate predictions. It keeps updating the model parameters until the cost function is close to, or equal to, zero, in order to achieve the least possible error. The cost function is used to measure the difference between the actual value and the predicted value (also known as the loss). It represents the error and tells us how far the predicted value is from the actual value.
TTOpt: A Maximum Volume Quantized Tensor Train-based Optimization and its Application to Reinforcement Learning
Sozykin, Konstantin, Chertkov, Andrei, Schutski, Roman, Phan, Anh-Huy, Cichocki, Andrzej, Oseledets, Ivan
We present a novel procedure for optimization based on the combination of efficient quantized tensor train representation and a generalized maximum matrix volume principle. We demonstrate the applicability of the new Tensor Train Optimizer (TTOpt) method for various tasks, ranging from minimization of multidimensional functions to reinforcement learning. Our algorithm compares favorably to popular evolutionary-based methods and outperforms them by the number of function evaluations or execution time, often by a significant margin.
Convergence of the mini-batch SIHT algorithm
The Iterative Hard Thresholding (IHT) algorithm has been considered extensively as an effective deterministic algorithm for solving sparse optimizations. The IHT algorithm benefits from the information of the batch (full) gradient at each point and this information is a crucial key for the convergence analysis of the generated sequence. However, this strength becomes a weakness when it comes to machine learning and high dimensional statistical applications because calculating the batch gradient at each iteration is computationally expensive or impractical. Fortunately, in these applications the objective function has a summation structure that can be taken advantage of to approximate the batch gradient by the stochastic mini-batch gradient. In this paper, we study the mini-batch Stochastic IHT (SIHT) algorithm for solving the sparse optimizations. As opposed to previous works where increasing and variable mini-batch size is necessary for derivation, we fix the mini-batch size according to a lower bound that we derive and show our work. To prove stochastic convergence of the objective value function we first establish a critical sparse stochastic gradient descent property. Using this stochastic gradient descent property we show that the sequence generated by the stochastic mini-batch SIHT is a supermartingale sequence and converges with probability one. Unlike previous work we do not assume the function to be a restricted strongly convex. To the best of our knowledge, in the regime of sparse optimization, this is the first time in the literature that it is shown that the sequence of the stochastic function values converges with probability one by fixing the mini-batch size for all steps.
A General Framework for Analyzing Stochastic Dynamics in Learning Algorithms
Chou, Chi-Ning, Sandhu, Juspreet Singh, Wang, Mien Brabeeba, Yu, Tiancheng
One of the challenges in analyzing learning algorithms is the circular entanglement between the objective value and the stochastic noise. This is also known as the "chicken and egg" phenomenon and traditionally, there is no principled way to tackle this issue. People solve the problem by utilizing the special structure of the dynamic, and hence the analysis would be difficult to generalize. In this work, we present a streamlined three-step recipe to tackle the "chicken and egg" problem and give a general framework for analyzing stochastic dynamics in learning algorithms. Our framework composes standard techniques from probability theory, such as stopping time and martingale concentration. We demonstrate the power and flexibility of our framework by giving a unifying analysis for three very different learning problems with the last iterate and the strong uniform high probability convergence guarantee. The problems are stochastic gradient descent for strongly convex functions, streaming principal component analysis, and linear bandit with stochastic gradient descent updates. We either improve or match the state-of-the-art bounds on all three dynamics.
Survey Descent: A Multipoint Generalization of Gradient Descent for Nonsmooth Optimization
For strongly convex objectives that are smooth, the classical theory of gradient descent ensures linear convergence relative to the number of gradient evaluations. An analogous nonsmooth theory is challenging. Even when the objective is smooth at every iterate, the corresponding local models are unstable and the number of cutting planes invoked by traditional remedies is difficult to bound, leading to convergences guarantees that are sublinear relative to the cumulative number of gradient evaluations. We instead propose a multipoint generalization of the gradient descent iteration for local optimization. While designed with general objectives in mind, we are motivated by a ``max-of-smooth'' model that captures the subdifferential dimension at optimality. We prove linear convergence when the objective is itself max-of-smooth, and experiments suggest a more general phenomenon.
What, Why, and How of SGD Momentum Optimizer in Deep Learning
In deep learning, we have used stochastic gradient descent as one of the optimizers because at the end we will find the minimum weight and bias at which the model loss is lowest. In the SGD we have some issues in which the SGD does not work perfectly because in deep learning we got a non-convex cost function graph and if use the simple SGD then it leads to low performance. At the start, we randomly start at some point and we are going to end up at the local minimum and not able to reach the global minimum. A saddle point is a point where in one direction the surface goes in the upward direction and in another direction it goes downwards. So that the slope is changing very gradually so the speed of changing is going to slow and as result, the training also going to slow.
Gradient Descent: Design Your First Machine Learning Model
The code is influenced by Jeremy Howard's deep learning course Paperspace Fast.AI . We will be using Pytorch for Model and Pyplot for visualization. Finally setting manual seed to reproduce the results. If you are using the Deepnote environment, you can copy and paste my code to get equivalent results. We can generate by using torch.arrange(start,end)
Stochastic Gradient Descent Captures How Children Learn About Physics
Buschoff, Luca M. Schulze, Schulz, Eric, Binz, Marcel
As children grow older, they develop an intuitive understanding of the physical processes around them. They move along developmental trajectories, which have been mapped out extensively in previous empirical research. We investigate how children's developmental trajectories compare to the learning trajectories of artificial systems. Specifically, we examine the idea that cognitive development results from some form of stochastic optimization procedure. For this purpose, we train a modern generative neural network model using stochastic gradient descent. We then use methods from the developmental psychology literature to probe the physical understanding of this model at different degrees of optimization. We find that the model's learning trajectory captures the developmental trajectories of children, thereby providing support to the idea of development as stochastic optimization.
Capacity dependent analysis for functional online learning algorithms
Guo, Xin, Guo, Zheng-Chu, Shi, Lei
This article provides convergence analysis of online stochastic gradient descent algorithms for functional linear models. Adopting the characterizations of the slope function regularity, the kernel space capacity, and the capacity of the sampling process covariance operator, significant improvement on the convergence rates is achieved. Both prediction problems and estimation problems are studied, where we show that capacity assumption can alleviate the saturation of the convergence rate as the regularity of the target function increases. We show that with properly selected kernel, capacity assumptions can fully compensate for the regularity assumptions for prediction problems (but not for estimation problems). This demonstrates the significant difference between the prediction problems and the estimation problems in functional data analysis.
From Weakly Supervised Learning to Active Learning
Applied mathematics and machine computations have raised a lot of hope since the recent success of supervised learning. Many practitioners in industries have been trying to switch from their old paradigms to machine learning. Interestingly, those data scientists spend more time scrapping, annotating and cleaning data than fine-tuning models. This thesis is motivated by the following question: can we derive a more generic framework than the one of supervised learning in order to learn from clutter data? This question is approached through the lens of weakly supervised learning, assuming that the bottleneck of data collection lies in annotation. We model weak supervision as giving, rather than a unique target, a set of target candidates. We argue that one should look for an ``optimistic'' function that matches most of the observations. This allows us to derive a principle to disambiguate partial labels. We also discuss the advantage to incorporate unsupervised learning techniques into our framework, in particular manifold regularization approached through diffusion techniques, for which we derived a new algorithm that scales better with input dimension then the baseline method. Finally, we switch from passive to active weakly supervised learning, introducing the ``active labeling'' framework, in which a practitioner can query weak information about chosen data. Among others, we leverage the fact that one does not need full information to access stochastic gradients and perform stochastic gradient descent.