Goto

Collaborating Authors

 Gradient Descent


Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning

arXiv.org Artificial Intelligence

We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers, each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k


Agnostic Learning of General ReLU Activation Using Gradient Descent

arXiv.org Artificial Intelligence

We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a ReLU function that achieves a competitive error guarantee when compared to the error of the best ReLU function. We also provide finite sample guarantees, and these techniques generalize to a broader class of marginal distributions beyond Gaussians.


A Nonlinear PID-Enhanced Adaptive Latent Factor Analysis Model

arXiv.org Artificial Intelligence

Abstract--High-dimensional and incomplete (HDI) data holds tremendous interactive information in various industrial applications. A latent factor (LF) model is remarkably effective in extracting valuable information from HDI data with stochastic gradient decent (SGD) algorithm. However, an SGD-based LFA model suffers from slow convergence since it only considers the current learning error. To address this critical issue, this paper proposes a Nonlinear PID-enhanced Adaptive Latent Factor (NPALF) model with two-fold ideas: 1) rebuilding the learning error via considering the past learning errors following the principle of a nonlinear PID controller; b) implementing all parameters adaptation effectively following the principle of a particle swarm optimization (PSO) algorithm. Experience results on four representative HDI datasets indicate that compared with five state-of-the-art LFA models, the NPALF model achieves better convergence rate and prediction accuracy for missing data of an HDI data.


Independent Policy Gradient for Large-Scale Markov Potential Games: Sharper Rates, Function Approximation, and Game-Agnostic Convergence

arXiv.org Artificial Intelligence

We examine global non-asymptotic convergence properties of policy gradient methods for multi-agent reinforcement learning (RL) problems in Markov potential games (MPG). To learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large, we propose new independent policy gradient algorithms that are run by all players in tandem. When there is no uncertainty in the gradient evaluation, we show that our algorithm finds an $\epsilon$-Nash equilibrium with $O(1/\epsilon^2)$ iteration complexity which does not explicitly depend on the state space size. When the exact gradient is not available, we establish $O(1/\epsilon^5)$ sample complexity bound in a potentially infinitely large state space for a sample-based algorithm that utilizes function approximation. Moreover, we identify a class of independent policy gradient algorithms that enjoys convergence for both zero-sum Markov games and Markov cooperative games with the players that are oblivious to the types of games being played. Finally, we provide computational experiments to corroborate the merits and the effectiveness of our theoretical developments.


On the Learnability of Physical Concepts: Can a Neural Network Understand What's Real?

arXiv.org Artificial Intelligence

We revisit the classic signal-to-symbol barrier in light of the remarkable ability of deep neural networks to generate realistic synthetic data. DeepFakes and spoofing highlight the feebleness of the link between physical reality and its abstract representation, whether learned by a digital computer or a biological agent. Starting from a widely applicable definition of abstract concept, we show that standard feed-forward architectures cannot capture but trivial concepts, regardless of the number of weights and the amount of training data, despite being extremely effective classifiers. On the other hand, architectures that incorporate recursion can represent a significantly larger class of concepts, but may still be unable to learn them from a finite dataset. We qualitatively describe the class of concepts that can be "understood" by modern architectures trained with variants of stochastic gradient descent, using a (free energy) Lagrangian to measure information complexity. Even if a concept has been understood, however, a network has no means of communicating its understanding to an external agent, except through continuous interaction and validation. We then characterize physical objects as abstract concepts and use the previous analysis to show that physical objects can be encoded by finite architectures. However, to understand physical concepts, sensors must provide persistently exciting observations, for which the ability to control the data acquisition process is essential (active perception). The importance of control depends on the modality, benefiting visual more than acoustic or chemical perception. Finally, we conclude that binding physical entities to digital identities is possible in finite time with finite resources, solving in principle the signal-to-symbol barrier problem, but we highlight the need for continuous validation.


A Particle-Based Algorithm for Distributional Optimization on \textit{Constrained Domains} via Variational Transport and Mirror Descent

arXiv.org Artificial Intelligence

We consider the optimization problem of minimizing an objective functional, which admits a variational form and is defined over probability distributions on the constrained domain, which poses challenges to both theoretical analysis and algorithmic design. Inspired by the mirror descent algorithm for constrained optimization, we propose an iterative particle-based algorithm, named Mirrored Variational Transport (mirrorVT), extended from the Variational Transport framework [7] for dealing with the constrained domain. In particular, for each iteration, mirrorVT maps particles to an unconstrained dual domain induced by a mirror map and then approximately perform Wasserstein gradient descent on the manifold of distributions defined over the dual space by pushing particles. At the end of iteration, particles are mapped back to the original constrained domain. Through simulated experiments, we demonstrate the effectiveness of mirrorVT for minimizing the functionals over probability distributions on the simplex- and Euclidean ball-constrained domains. We also analyze its theoretical properties and characterize its convergence to the global minimum of the objective functional.


Adaptive Latent Factor Analysis via Generalized Momentum-Incorporated Particle Swarm Optimization

arXiv.org Artificial Intelligence

Stochastic gradient descent (SGD) algorithm is an effective learning strategy to build a latent factor analysis (LFA) model on a high-dimensional and incomplete (HDI) matrix. A particle swarm optimization (PSO) algorithm is commonly adopted to make an SGD-based LFA model's hyper-parameters, i.e, learning rate and regularization coefficient, self-adaptation. However, a standard PSO algorithm may suffer from accuracy loss caused by premature convergence. To address this issue, this paper incorporates more historical information into each particle's evolutionary process for avoiding premature convergence following the principle of a generalized-momentum (GM) method, thereby innovatively achieving a novel GM-incorporated PSO (GM-PSO). With it, a GM-PSO-based LFA (GMPL) model is further achieved to implement efficient self-adaptation of hyper-parameters. The experimental results on three HDI matrices demonstrate that the GMPL model achieves a higher prediction accuracy for missing data estimation in industrial applications.


SGEM: stochastic gradient with energy and momentum

arXiv.org Artificial Intelligence

In this paper, we propose SGEM, Stochastic Gradient with Energy and Momentum, to solve a large class of general non-convex stochastic optimization problems, based on the AEGD method that originated in the work [AEGD: Adaptive Gradient Descent with Energy. arXiv: 2010.05109]. SGEM incorporates both energy and momentum at the same time so as to inherit their dual advantages. We show that SGEM features an unconditional energy stability property, and derive energy-dependent convergence rates in the general nonconvex stochastic setting, as well as a regret bound in the online convex setting. A lower threshold for the energy variable is also provided. Our experimental results show that SGEM converges faster than AEGD and generalizes better or at least as well as SGDM in training some deep neural networks.


Gradient descent provably escapes saddle points in the training of shallow ReLU networks

arXiv.org Artificial Intelligence

Dynamical systems theory has recently been applied in optimization to prove that gradient descent algorithms avoid so-called strict saddle points of the loss function. However, in many modern machine learning applications, the required regularity conditions are not satisfied. In particular, this is the case for rectified linear unit (ReLU) networks. In this paper, we prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements. Then, we verify that shallow ReLU networks fit into the new framework. Building on a classification of critical points of the square integral loss of shallow ReLU networks measured against an affine target function, we deduce that gradient descent avoids most saddle points. We proceed to prove convergence to global minima if the initialization is sufficiently good, which is expressed by an explicit threshold on the limiting loss.


The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

arXiv.org Artificial Intelligence

In transfer learning (Pan and Yang, 2009; Sugiyama and Kawanabe, 2012), an algorithm is provided with abundant data from a source domain and scarce or no data from a target domain, and aims to train a model that generalizes well on the target domain. A simple yet effective approach is to pretrain a model with the rich source data and then finetune the model with the available target data via, e.g., stochastic gradient descent (SGD) (see, e.g., Yosinski et al. (2014)). Despite its wide applicability in practice, the power and limitation of the pretraining-finetuning based transfer learning framework is not fully understood in theory. The focus of this work is to consider this issue in a specific transfer learning setup known as covariate shift (Pan and Yang, 2009; Sugiyama and Kawanabe, 2012), where the source and target distributions differ in their marginal distributions over the input, but coincide in their conditional distribution of the output given the input. Regarding the theory of learning with covariate shift, there exists a rich set of results (Ben-David et al., 2010; Germain et al., 2013; Mansour et al., 2009; Mohri and Muñoz Medina, 2012; Cortes and