Gradient Descent
When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work
Zhang, Jiawei, Zhang, Yushun, Hong, Mingyi, Sun, Ruoyu, Luo, Zhi-Quan
Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape? In this work, we provide partially affirmative answers to both questions for 1-hidden-layer networks with fewer than $n$ (sample size) neurons when the activation is smooth. First, we prove that as long as the width $m \geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. Second, we identify a nice local region with no local-min or saddle points. Nevertheless, it is not clear whether gradient descent can stay in this nice region. Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer. It is expected that projected gradient methods converge to KKT points under mild technical conditions, but we leave the rigorous convergence analysis to future work. Thorough numerical results show that projected gradient methods on this constrained formulation significantly outperform SGD for training narrow neural nets.
A sharp uniform-in-time error estimate for Stochastic Gradient Langevin Dynamics
The Stochastic Gradient Langevin Dynamics (SGLD) [49], first proposed by Welling and Teh, has drawn great attention of researchers when dealing with optimization or sampling tasks[2, 33, 40]. As a samplingalgorithm, SGLD canbe viewed asa"randombatch"version of the Unadjusted Langevin Algorithm (ULA), which is the Euler-Maruyama discretization of the Langevin diffusion, a stochastic process converging to a target Gibbs' distribution under suitable settings. As an optimization algorithm, SGLD can be viewed as a variant of the classical Stochastic Gradient Descent (SGD) [44], by adding independent Gaussian noise in each iteration of SGD. At recent decades, SGD and its variants [44, 25, 11, 37] have received a great deal of attention when solving high-dimensional tasks, ranging from computer vision, natural language processing, to high dimensional sampling, statistical optimization, etc. Also much theoretical analysis for SGD has been done by former researchers, including loss landscape of SGD iteration [46, 47], its dynamical stability [50] and diffusion approximation [32, 21, 17]. The combination of the SGD algorithm and the Langevin diffusion, can improve the behavior of both methods: for SGD, by taking another independent diffusion term into consideration, though not converging to a fixed point, the algorithm may be able to admit better ergodic properties and obtain better performance near saddle points [26, 52]. Besides, the application of the methodology of random mini-batch to Langevin diffusion could result in some efficient methods that could reduce computational cost while preserving the dynamical and statistical properties. Examples include the SGLD algorithm we study in the paper and the random batch methods for interacting particle systems [22, 23].
Sequential Gradient Descent and Quasi-Newton's Method for Change-Point Analysis
One common approach to detecting change-points is minimizing a cost function over possible numbers and locations of change-points. The framework includes several well-established procedures, such as the penalized likelihood and minimum description length. Such an approach requires finding the cost value repeatedly over different segments of the data set, which can be time-consuming when (i) the data sequence is long and (ii) obtaining the cost value involves solving a non-trivial optimization problem. This paper introduces a new sequential method (SE) that can be coupled with gradient descent (SeGD) and quasi-Newton's method (SeN) to find the cost value effectively. The core idea is to update the cost value using the information from previous steps without re-optimizing the objective function. The new method is applied to change-point detection in generalized linear models and penalized regression. Numerical studies show that the new approach can be orders of magnitude faster than the Pruned Exact Linear Time (PELT) method without sacrificing estimation accuracy.
Refined Convergence and Topology Learning for Decentralized SGD with Heterogeneous Data
Bars, Batiste Le, Bellet, Aurélien, Tommasi, Marc, Lavoie, Erick, Kermarrec, Anne-Marie
One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of the popular Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called neighborhood heterogeneity, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.
PSO-PINN: Physics-Informed Neural Networks Trained with Particle Swarm Optimization
Davi, Caio, Braga-Neto, Ulisses
Physics-informed neural networks (PINN) have recently emerged as a promising application of deep learning in a wide range of engineering and scientific problems based on partial differential equation (PDE) models. However, evidence shows that PINN training by gradient descent displays pathologies that often prevent convergence when solving PDEs with irregular solutions. In this paper, we propose the use of a particle swarm optimization (PSO) approach to train PINNs. The resulting PSO-PINN algorithm not only mitigates the undesired behaviors of PINNs trained with standard gradient descent but also presents an ensemble approach to PINN that affords the possibility of robust predictions with quantified uncertainty. We also propose PSO-BP-CD (PSO with Back-Propagation and Coefficient Decay), a hybrid PSO variant that combines swarm optimization with gradient descent, putting more weight on the latter as training progresses and the swarm zeros in on a good local optimum. Comprehensive experimental results show that PSO-PINN with the proposed PSO-BP-CD algorithm outperforms PINN ensembles trained with other PSO variants or with pure gradient descent.
TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets
Yang, Chengrun, Bender, Gabriel, Liu, Hanxiao, Kindermans, Pieter-Jan, Udell, Madeleine, Lu, Yifeng, Le, Quoc, Huang, Da
The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.
A note on diffusion limits for stochastic gradient descent
Lanconelli, Alberto, Lauria, Christopher S. A.
In the machine learning literature stochastic gradient descent has recently been widely discussed for its purported implicit regularization properties. Much of the theory, that attempts to clarify the role of noise in stochastic gradient algorithms, has widely approximated stochastic gradient descent by a stochastic differential equation with Gaussian noise. We provide a novel rigorous theoretical justification for this practice that showcases how the Gaussianity of the noise arises naturally.
Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures
Abbe, Emmanuel, Bengio, Samy, Cornacchia, Elisabetta, Kleinberg, Jon, Lotfi, Aryo, Raghu, Maithra, Zhang, Chiyuan
This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.
A Unified Convergence Theorem for Stochastic Optimization Methods
In this work, we provide a fundamental unified convergence theorem used for deriving expected and almost sure convergence results for a series of stochastic optimization methods. Our unified theorem only requires to verify several representative conditions and is not tailored to any specific algorithm. As a direct application, we recover expected and almost sure convergence results of the stochastic gradient method (SGD) and random reshuffling (RR) under more general settings. Moreover, we establish new expected and almost sure convergence results for the stochastic proximal gradient method (prox-SGD) and stochastic model-based methods (SMM) for nonsmooth nonconvex optimization problems. These applications reveal that our unified theorem provides a plugin-type convergence analysis and strong convergence guarantees for a wide class of stochastic optimization methods.
Muffliato: Peer-to-Peer Privacy Amplification for Decentralized Optimization and Averaging
Cyffers, Edwige, Even, Mathieu, Bellet, Aurélien, Massoulié, Laurent
Decentralized optimization is increasingly popular in machine learning for its scalability and efficiency. Intuitively, it should also provide better privacy guarantees, as nodes only observe the messages sent by their neighbors in the network graph. But formalizing and quantifying this gain is challenging: existing results are typically limited to Local Differential Privacy (LDP) guarantees that overlook the advantages of decentralization. In this work, we introduce pairwise network differential privacy, a relaxation of LDP that captures the fact that the privacy leakage from a node $u$ to a node $v$ may depend on their relative position in the graph. We then analyze the combination of local noise injection with (simple or randomized) gossip averaging protocols on fixed and random communication graphs. We also derive a differentially private decentralized optimization algorithm that alternates between local gradient descent steps and gossip averaging. Our results show that our algorithms amplify privacy guarantees as a function of the distance between nodes in the graph, matching the privacy-utility trade-off of the trusted curator, up to factors that explicitly depend on the graph topology. Finally, we illustrate our privacy gains with experiments on synthetic and real-world datasets.