pl condition
On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms Anonymous Author(s) Affiliation Address email
Stochastic gradient descent (SGD) algorithm is the method of choice in many1 machine learning tasks thanks to its scalability and efficiency in dealing with2 large-scale problems. In this paper, we focus on the shuffling version of SGD3 which matches the mainstream practical heuristics. We show the convergence4 to a global solution of shuffling SGD for a class of non-convex functions un-5 der over-parameterized settings. Our analysis employs more relaxed non-convex6 assumptions than previous literature. Nevertheless, we maintain the desired compu-7 tational complexity as shuffling SGD has achieved in the general convex setting.8 1 Introduction9 In the last decade, neural network-based models have shown great success in many machine learning10 applications such as natural language processing [Collobert and Weston, 2008, Goldberg et al., 2018],11 computer vision and pattern recognition [Goodfellow et al., 2014, He and Sun, 2015].
An Online Method for AClass of Distributionally Robust Optimization with Non-Convex Objectives
In this paper, we propose a practical online method for solving a class of distributionally robust optimization (DRO) with non-convex objectives, which has important applications in machine learning for improving the robustness of neural networks. In the literature, most methods for solving DRO are based on stochastic primal-dual methods. However, primal-dual methods for DRO suffer from several drawbacks: (1) manipulating a high-dimensional dual variable corresponding to the size of data is time expensive; (2) they are not friendly to online learning where data is coming sequentially. To address these issues, we consider a class of DRO with an KL divergence regularization on the dual variables, transform the minmax problem into a compositional minimization problem, and propose practical duality-free online stochastic methods without requiring a large mini-batch size. We establish the state-of-the-art complexities of the proposed methods with and without a Polyak-ลojasiewicz (PL) condition of the objective. Empirical studies on large-scale deep learning tasks (i) demonstrate that our method can speed up the training by more than 2 times than baseline methods and save days of training time on a large-scale dataset with 265K images, and (ii) verify the supreme performance of DRO over Empirical Risk Minimization (ERM) on imbalanced datasets. Of independent interest, the proposed method can be also used for solving a family of stochastic compositional problems with state-of-the-art complexities.
Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture
In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity. The width of these general networks is characterized by the minimum indegree of their neurons, except for the input and first layers. Our results identify the mathematical structure underlying transition to linearity and generalize a number of recent works aimed at characterizing transition to linearity or constancy of the Neural Tangent Kernel for standard architectures.
On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms Lam M. Nguyen
Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings.
A Generalized Alternating Method for Bilevel
Bilevel optimization has recently regained interest owing to its applications in emerging machine learning fields such as hyperparameter optimization, meta-learning, and reinforcement learning. Recent results have shown that simple alternating (implicit) gradient-based algorithms can match the convergence rate of single-level gradient descent (GD) when addressing bilevel problems with a strongly convex lower-level objective. However, it remains unclear whether this result can be generalized to bilevel problems beyond this basic setting.