Gradient Descent
Reviews: Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences
However, this paper is not carefully written. For example, the references are missing on page 6, line 192 and page 7, line 206. The legend of red lines are missing for Figure 2c,d. The paper states only the necessary information but not sufficient for the readers to follow easily. I think the clarity of this paper could be greatly improved especially the authors did not use the full 8 pages.
Reviews: Dual Space Gradient Descent for Online Learning
The rebuttal answered all my technical questions, in particular the one about probability and expectations in Theorems 1(ii) and 3. My scores (which were given assuming the technicalities can be fixed) remain the same. The theoretical results do not feel that exciting. The regret bounds seem more or less what one might expect, and there is no comparison to bounds for other algorithms. I understand that there are no directly comparable bounds since you have both budget size and dimensionality of random projections as significant parameters, but still some discussion would be nice. The dependence on budget B in your bounds could perhaps be pointed out more explicitly.
Reviews: Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent
The main contribution in this work is showing that in the context of matrix completion, when equipped with a good initialization, the sequence of the solutions produced by a widely used (but without theoretical guarantee) SGD algorithm converges linearly to the true matrix. The paper reads well and most of theoretical result looks sound. But I have some concerns as follows. Is it possible to access fewer samples, e.g., "O(mu * d * k * log(d))" while still ensuring global convergence? Theorem 3.1 says "for any fixed T 1, with probability at least 1 - T/(d 10), we have linear convergence". That means, if we run the algorithm (i.e., repeatedly reveal the samples) for a longer time, we have a LOWER confidence to obtain the true matrix.
Reviews: The Sound of APALM Clapping: Faster Nonsmooth Nonconvex Optimization with Stochastic Asynchronous PALM
I find the approach rather interesting, especially the broad and general definition of the problem makes the approach applicable to a wide range of problems. However, I was surprised by the absence of any reference to the seminal Robbins/Munro paper and also to the recent developments in stochastic gradient descent based sampling (see below). The authors do local gradient descent updates of coordinate blocks by computing partial gradients and adding noise in each asynchronous step. I was wondering, how this relates to the "usual" stochastic gradient descent update, i.e., given that the locally computed partial gradient will be based on delayed (noisy) variable states, a sequence of these noisy partial gradients would converge to the true partial gradient as well. Further, recent SGD based sampling has shown that adding noise to the variable states obtained by noisy gradient updates (as the authors do as well) provides good samples of the distribution underlying the optimal variable setting also in a non-convex setting. That being said, the work handed in remains valid, but it would have been interesting to compare the proposed approach to well established stochastic gradient methods.
Reviews: Byzantine Stochastic Gradient Descent
The paper studies stochastic convex optimization in a distributed master/workers framework, where on each round each machine out of m produces a stochastic gradient and sends it to the master, which aggregates these into a mini-batch. In this paper the authors allow a fraction of alpha of the machines to be Byzantine, i.e., they do not need to report valid stochastic gradients but may produce arbitrary vectors, even in an adversarial manner. The goal is to aggregate the reports of the machines and to converge to an optimal solution of the convex objective despite the malicious Byzantine machines. The authors present a novel variant of minibatch-SGD which tackles the difficulty the dealing with Byzantine machines. They prove upper-bounds on the convergence and nearly optimal matching lower-bounds on any algorithm working in such framework, and in this sense the results are quite satisfactory.
Reviews: Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
This paper studies learning over-parametrized single hidden layer ReLU neural networks for multi-class classification via SGD and the corresponding generalization error. They consider a mixture data distribution where each class has well-separated and compact support. The authors show SGD applied on the considered learning model achieves good prediction error with high probability under suitable assumptions. As a result even in severely over-parametrized models, SGD can generalize well although the network has enough capacity to fit arbitrary labels. The main insight in the theoretical analysis appears to be the observation that in the over-parametrized case, many ReLU neurons don't change their activation pattern when initialized randomly.
Faster Differentially Private Convex Optimization via Second-Order Methods
Differentially private (stochastic) gradient descent is the workhorse of DP private machine learning in both the convex and non-convex settings. In this work, we investigate the prospect of using the second-order information from the loss function to accelerate DP convex optimization. We first develop a private variant of the regularized cubic Newton method of Nesterov and Polyak, and show that for the class of strongly convex loss functions, our algorithm has quadratic convergence and achieves the optimal excess loss. We theoretically and empirically study the performance of our algorithm. Empirical results show our algorithm consistently achieves the best excess loss compared to other baselines and is 10-40x faster than DP-GD/DP-SGD for challenging datasets.
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model & task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by 5%, 6.7%, and 5.8%, respectively, while using only 15% of their total training computational cost.
Preconditioning Matters: Fast Global Convergence of Non-convex Matrix Factorization via Scaled Gradient Descent
Low-rank matrix factorization (LRMF) is a canonical problem in non-convex optimization, the objective function to be minimized is non-convex and even non-smooth, which makes the global convergence guarantee of gradient-based algorithm quite challenging. While the dependence of the convergence on the \textit{condition number} \kappa and \textit{small learning rate} makes it not practical especially for ill-conditioned LRMF problem.In this paper, we show that precondition helps in accelerating the convergence and prove that the scaled gradient descent (ScaledGD) and its variant, alternating scaled gradient descent (AltScaledGD) converge to an \varepsilon -global minima after O( {\rm ln} \frac{d}{\delta} {\rm ln} \frac{d}{\varepsilon}) iterations from general random initialization. Furthermore, we prove that as a proximity to the alternating minimization, AltScaledGD converges faster than ScaledGD, its global convergence does not rely on small learning rate and small initialization, which certificates the advantages of AltScaledGD in LRMF.
On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms
Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings. Our analysis employs more relaxed non-convex assumptions than previous literature. Nevertheless, we maintain the desired computational complexity as shuffling SGD has achieved in the general convex setting.