Optimization
Distributed Optimization for Overparameterized Problems: Achieving Optimal Dimension Independent Communication Complexity
Decentralized optimization are playing an important role in applications such as training large machine learning models, among others. Despite its superior practical performance, there has been some lack of fundamental understanding about its theoretical properties. In this work, we address the following open research question: To train an overparameterized model over a set of distributed nodes, what is the minimum communication overhead (in terms of the bits got exchanged) that the system needs to sustain, while still achieving (near) zero training loss? We show that for a class of overparameterized models where the number of parameters D is much larger than the total data samples N, the best possible communication complexity is (N), which is independent of the problem dimension D. Further, for a few specific overparameterized models (i.e., the linear regression, and certain multi-layer neural network with one wide layer), we develop a set of algorithms which uses certain linear compression followed by adaptive quantization, and show that they achieve dimension independent, near-optimal communication complexity. To our knowledge, this is the first time that dimension independent communication complexity has been shown for distributed optimization.
Projection-Free Methods for Stochastic Simple Bilevel Optimization with Convex Lower-level Problem
In this paper, we study a class of stochastic bilevel optimization problems, also known as stochastic simple bilevel optimization, where we minimize a smooth stochastic objective function over the optimal solution set of another stochastic convex optimization problem. We introduce novel stochastic bilevel optimization methods that locally approximate the solution set of the lower-level problem via a stochastic cutting plane, and then run a conditional gradient update with variance reduction techniques to control the error induced by using stochastic gradients. For the case that the upper-level function is convex, our method requires O(max{1/ฯต2f,1/ฯต2g}) stochastic oracle queries to obtain a solution that is ฯตfoptimal for the upper-level and ฯตg-optimal for the lower-level. This guarantee improves the previous best-known complexity of O(max{1/ฯต4f,1/ฯต4g}). Moreover, for the case that the upper-level function is non-convex, our method requires at most O(max{1/ฯต3f,1/ฯต3g})stochastic oracle queries to find an (ฯตf,ฯตg)-stationary point. In the finite-sum setting, we show that the number of stochastic oracle calls required by our method are O( n/ฯต) and O( n/ฯต2) for the convex and non-convex settings, respectively, where ฯต = min{ฯตf,ฯตg}.
Scale-invariant Learning by Physics Inversion
Solving inverse problems, such as parameter estimation and optimal control, is a vital part of science. Many experiments repeatedly collect data and rely on machine learning algorithms to quickly infer solutions to the associated inverse problems. We find that state-of-the-art training techniques are not well-suited to many problems that involve physical processes. The highly nonlinear behavior, common in physical processes, results in strongly varying gradients that lead first-order optimizers like SGD or Adam to compute suboptimal optimization directions. We propose a novel hybrid training approach that combines higherorder optimization methods with machine learning techniques. We take updates from a scale-invariant inverse problem solver and embed them into the gradientdescent-based learning pipeline, replacing the regular gradient of the physical process. We demonstrate the capabilities of our method on a variety of canonical physical systems, showing that it yields significant improvements on a wide range of optimization and learning problems.