AITopics

2102.05274

Country: North America > United States > New York > Suffolk County > Stony Brook (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Ambrogioni, Luca, Silvestri, Gianluigi, van Gerven, Marcel

Automatic variational inference with cascading flows

The automation of probabilistic reasoning is one of the primary aims of machine learning. Recently, the confluence of variational inference and deep learning has led to powerful and flexible automatic inference methods that can be trained by stochastic gradient descent. In particular, normalizing flows are highly parameterized deep models that can fit arbitrarily complex posterior densities. However, normalizing flows struggle in highly structured probabilistic programs as they need to relearn the forward-pass of the program. Automatic structured variational inference (ASVI) remedies this problem by constructing variational programs that embed the forward-pass. Here, we combine the flexibility of normalizing flows and the prior-embedding property of ASVI in a new family of variational programs, which we named cascading flows. A cascading flows program interposes a newly designed highway flow architecture in between the conditional distributions of the prior program such as to steer it toward the observed data. These programs can be constructed automatically from an input probabilistic program and can also be amortized automatically. We evaluate the performance of the new variational programs in a series of structured inference problems. We find that cascading flows have much higher performance than both normalizing flows and ASVI in a large set of structured inference problems.

deep learning, neural network, upstream oil & gas, (16 more...)

2102.04801

Genre: Research Report (0.40)

Industry: Energy > Oil & Gas > Upstream (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

A Single-Timescale Stochastic Bilevel Optimization Method

Chen, Tianyi, Sun, Yuejiao, Yin, Wotao

Stochastic bilevel optimization generalizes the classic stochastic optimization from the minimization of a single objective to the minimization of an objective function that depends the solution of another optimization problem. Recently, stochastic bilevel optimization is regaining popularity in emerging machine learning applications such as hyper-parameter optimization and model-agnostic meta learning. To solve this class of stochastic optimization problems, existing methods require either double-loop or two-timescale updates, which are sometimes less efficient. This paper develops a new optimization method for a class of stochastic bilevel problems that we term Single-Timescale stochAstic BiLevEl optimization (STABLE) method. STABLE runs in a single loop fashion, and uses a single-timescale update with a fixed batch size. To achieve an $\epsilon$-stationary point of the bilevel problem, STABLE requires ${\cal O}(\epsilon^{-2})$ samples in total; and to achieve an $\epsilon$-optimal solution in the strongly convex case, STABLE requires ${\cal O}(\epsilon^{-1})$ samples. To the best of our knowledge, this is the first bilevel optimization algorithm achieving the same order of sample complexity as the stochastic gradient descent method for the single-level stochastic optimization.

algorithm, bilevel problem, optimization, (14 more...)

2102.04671

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Austria > Vienna (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(6 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Tang, Wenpin, Zhou, Xun Yu

Simulated annealing from continuum to discretization: a convergence analysis via the Eyring--Kramers law

We study the convergence rate of continuous-time simulated annealing $(X_t; \, t \ge 0)$ and its discretization $(x_k; \, k =0,1, \ldots)$ for approximating the global optimum of a given function $f$. We prove that the tail probability $\mathbb{P}(f(X_t) > \min f +\delta)$ (resp. $\mathbb{P}(f(x_k) > \min f +\delta)$) decays polynomial in time (resp. in cumulative step size), and provide an explicit rate as a function of the model parameters. Our argument applies the recent development on functional inequalities for the Gibbs measure at low temperatures -- the Eyring-Kramers law. In the discrete setting, we obtain a condition on the step size to ensure the convergence.

convergence rate, equation, inequality, (15 more...)

2102.02339

Country:

Asia > Middle East > Jordan (0.04)
North America > United States (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)

Stability and Generalization of the Decentralized Stochastic Gradient Descent

Sun, Tao, Li, Dongsheng, Wang, Bao

The stability and generalization of stochastic gradient-based methods provide valuable insights into understanding the algorithmic performance of machine learning models. As the main workhorse for deep learning, stochastic gradient descent has received a considerable amount of studies. Nevertheless, the community paid little attention to its decentralized variants. In this paper, we provide a novel formulation of the decentralized stochastic gradient descent. Leveraging this formulation together with (non)convex optimization theory, we establish the first stability and generalization guarantees for the decentralized stochastic gradient descent. Our theoretical results are built on top of a few common and mild assumptions and reveal that the decentralization deteriorates the stability of SGD for the first time. We verify our theoretical findings by using a variety of decentralized settings and benchmark machine learning models.

d-sgd, generalization, stability, (11 more...)

2102.01302

Country:

Europe > Montenegro (0.04)
North America > United States (0.04)
Asia > China > Hunan Province (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Chatterji, Niladri S., Long, Philip M., Bartlett, Peter L.

When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

arXiv.org Artificial IntelligenceFeb-9-2021

Interest in the properties of interpolating deep learning m odels trained with first-order optimization methods is surging [ Zha 17a; Bel 19 ]. One important question is to understand how gradient descent with appropriate random initialization r outinely finds interpolating (near-zero training loss) solutions to these non-convex optimization problems. In this paper our focus is to understand when gradient descen t drives the logistic loss to zero when applied to fixed-width deep networks using smooth a pproximations to the ReLU activation function. We derive upper bounds on the rate of co nvergence under two conditions. The first result only requires that the initial loss is small, but does not require any assumption about the width of the network. It guarantees that if the init ial loss is small then gradient descent drives the logistic loss down to zero.

nullnull, nullv, probability, (15 more...)

arXiv.org Artificial Intelligence

2102.04998

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Montana (0.04)

Genre: Research Report (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Hu, Zhengmian, Huang, Feihu, Huang, Heng

A New Framework for Variance-Reduced Hamiltonian Monte Carlo

arXiv.org Machine LearningFeb-8-2021

We propose a new framework of variance-reduced Hamiltonian Monte Carlo (HMC) methods for sampling from an $L$-smooth and $m$-strongly log-concave distribution, based on a unified formulation of biased and unbiased variance reduction methods. We study the convergence properties for HMC with gradient estimators which satisfy the Mean-Squared-Error-Bias (MSEB) property. We show that the unbiased gradient estimators, including SAGA and SVRG, based HMC methods achieve highest gradient efficiency with small batch size under high precision regime, and require $\tilde{O}(N + \kappa^2 d^{\frac{1}{2}} \varepsilon^{-1} + N^{\frac{2}{3}} \kappa^{\frac{4}{3}} d^{\frac{1}{3}} \varepsilon^{-\frac{2}{3}} )$ gradient complexity to achieve $\epsilon$-accuracy in 2-Wasserstein distance. Moreover, our HMC methods with biased gradient estimators, such as SARAH and SARGE, require $\tilde{O}(N+\sqrt{N} \kappa^2 d^{\frac{1}{2}} \varepsilon^{-1})$ gradient complexity, which has the same dependency on condition number $\kappa$ and dimension $d$ as full gradient method, but improves the dependency of sample size $N$ for a factor of $N^\frac{1}{2}$. Experimental results on both synthetic and real-world benchmark data show that our new framework significantly outperforms the full gradient and stochastic gradient HMC approaches. The earliest version of this paper was submitted to ICML 2020 with three weak accept but was not finally accepted.

algorithm, gradient complexity, gradient estimator, (13 more...)

2102.04613

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)

Jhunjhunwala, Divyansh, Gadhikar, Advait, Joshi, Gauri, Eldar, Yonina C.

Adaptive Quantization of Model Updates for Communication-Efficient Federated Learning

arXiv.org Machine LearningFeb-8-2021

Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning, especially in bandwidth-limited settings and high-dimensional models. Gradient quantization is an effective way of reducing the number of bits required to communicate each model update, albeit at the cost of having a higher error floor due to the higher variance of the stochastic gradients. In this work, we propose an adaptive quantization strategy called AdaQuantFL that aims to achieve communication efficiency as well as a low error floor by changing the number of quantization levels during the course of training. Experiments on training deep neural networks show that our method can converge in much fewer communicated bits as compared to fixed quantization level setups, with little or no impact on training and test accuracy.

adaquantfl, experiment, quantization level, (12 more...)

2102.04487

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Paquette, Courtney, Lee, Kiwon, Pedregosa, Fabian, Paquette, Elliot

SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

arXiv.org Machine LearningFeb-8-2021

We propose a new framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gradient descent (SGD) when both number of samples and dimensions are large. This framework applies to any fixed stepsize and the finite sum setting. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit. Furthermore, the limiting dynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which we also verify experimentally. Finally, when input data is isotropic, we provide explicit expressions for the dynamics and average-case convergence rates (i.e., the complexity of an algorithm averaged over all possible inputs). These rates show significant improvement over the worst-case complexities.

overwhelming probability, probability, volterra equation, (14 more...)

2102.04396

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Mathematics of Computing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

arXiv.org Artificial IntelligenceFeb-8-2021

Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games

Wei, Chen-Yu, Lee, Chung-Wei, Zhang, Mengxiao, Luo, Haipeng

We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithm is based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn the policies, with a critic that slowly learns the value of each state. To the best of our knowledge, this is the first algorithm in this setting that is simultaneously rational (converging to the opponent's best response when it uses a stationary policy), convergent (converging to the set of Nash equilibria under self-play), agnostic (no need to know the actions played by the opponent), symmetric (players taking symmetric roles in the algorithm), and enjoying a finite-time last-iterate convergence guarantee, all of which are desirable properties of decentralized algorithms.

algorithm, dist 2, onvergence, (16 more...)

arXiv.org Artificial Intelligence

2102.0454

Country:

North America > United States > California (0.14)
Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)