AITopics

2301.13703

Country:

Europe > Switzerland > Vaud > Lausanne (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Mei, Jonathan, Moreno, Alexander, Walters, Luke

KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned Stochastic Optimization

arXiv.org Artificial IntelligenceMay-30-2023

Second order stochastic optimizers allow parameter update step size and direction to adapt to loss curvature, but have traditionally required too much memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018] introduced a Kronecker factored preconditioner to reduce these requirements: it is used for large deep models [Anil et al., 2020] and in production [Anil et al., 2022]. However, it takes inverse matrix roots of ill-conditioned matrices. This requires 64-bit precision, imposing strong hardware constraints. In this paper, we propose a novel factorization, Kronecker Approximation-Domination (KrAD). Using KrAD, we update a matrix that directly approximates the inverse empirical Fisher matrix (like full matrix AdaGrad), avoiding inversion and hence 64-bit precision. We then propose KrADagrad$^\star$, with similar computational costs to Shampoo and the same regret. Synthetic ill-conditioned experiments show improved performance over Shampoo for 32-bit precision, while for several real datasets we have comparable or better generalization.

artificial intelligence, machine learning, matrix, (18 more...)

2305.19416

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Netherlands > South Holland > Dordrecht (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Pogodin, Roman, Cornford, Jonathan, Ghosh, Arna, Gidel, Gauthier, Lajoie, Guillaume, Richards, Blake

Synaptic Weight Distributions Depend on the Geometry of Plasticity

arXiv.org Artificial IntelligenceMay-30-2023

Most learning algorithms in machine learning rely on gradient descent to adjust model parameters, and a growing literature in computational neuroscience leverages these ideas to study synaptic plasticity in the brain. However, the vast majority of this work ignores a critical underlying assumption: the choice of distance for synaptic changes (i.e. the geometry of synaptic plasticity). Gradient descent assumes that the distance is Euclidean, but many other distances are possible, and there is no reason that biology necessarily uses Euclidean geometry. Here, using the theoretical tools provided by mirror descent, we show that, regardless of the loss being minimized, the distribution of synaptic weights will depend on the geometry of synaptic plasticity. We use these results to show that experimentally-observed log-normal weight distributions found in several brain areas are not consistent with standard gradient descent (i.e. a Euclidean geometry), but rather with non-Euclidean distances. Finally, we show that it should be possible to experimentally test for different synaptic geometries by comparing synaptic weight distributions before and after learning. Overall, this work shows that the current paradigm in theoretical work on synaptic plasticity that assumes Euclidean synaptic geometry may be misguided and that it should be possible to experimentally determine the true geometry of synaptic plasticity in the brain.

artificial intelligence, deep learning, machine learning, (17 more...)

2305.19394

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

A Novel Noise Injection-based Training Scheme for Better Model Robustness

Zhang, Zeliang, Jiang, Jinyang, Chen, Minjie, Wang, Zhiyuan, Peng, Yijie, Yu, Zhaofei

Noise injection-based method has been shown to be able to improve the robustness of artificial neural networks in previous work. In this work, we propose a novel noise injection-based training scheme for better model robustness. Specifically, we first develop a likelihood ratio method to estimate the gradient with respect to both synaptic weights and noise levels for stochastic gradient descent training. Then, we design an approximation for the vanilla noise injection-based training method to reduce memory and improve computational efficiency. Next, we apply our proposed scheme to spiking neural networks and evaluate the performance of classification accuracy and robustness on MNIST and Fashion-MNIST datasets. Experiment results show that our proposed method achieves a much better performance on adversarial robustness and slightly better performance on original accuracy, compared with the conventional gradient-based training method.

adversarial attack, artificial intelligence, machine learning, (19 more...)

2302.10802

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.34)

Industry:

Information Technology (0.96)
Energy > Oil & Gas (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Ishfaq, Haque, Lan, Qingfeng, Xu, Pan, Mahmood, A. Rupam, Precup, Doina, Anandkumar, Anima, Azizzadenesheli, Kamyar

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{5/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2305.18246

Country:

North America > Canada > Alberta (0.14)
North America > Canada > Quebec (0.14)
North America > United States (0.14)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (0.46)
Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.54)

Arnaboldi, Luca, Krzakala, Florent, Loureiro, Bruno, Stephan, Ludovic

Escaping mediocrity: how two-layer networks learn hard single-index models with SGD

This study explores the sample complexity for two-layer neural networks to learn a single-index target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is well-established that in this scenario $n=O(d\log{d})$ samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.

artificial intelligence, initialization, machine learning, (17 more...)

2305.18502

Country:

North America > United States (0.14)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > France (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization

Zhu, Feng, Zhang, Jingjing, Wang, Xin

Synchronous local stochastic gradient descent (local SGD) suffers from some workers being idle and random delays due to slow and straggling workers, as it waits for the workers to complete the same amount of local updates. In this paper, to mitigate stragglers and improve communication efficiency, a novel local SGD strategy, named STSyn, is developed. The key point is to wait for the $K$ fastest workers, while keeping all the workers computing continually at each synchronization round, and making full use of any effective (completed) local update of each worker regardless of stragglers. An analysis of the average wall-clock time, average number of local updates and average number of uploading workers per round is provided to gauge the performance of STSyn. The convergence of STSyn is also rigorously established even when the objective function is nonconvex. Experimental results show the superiority of the proposed STSyn against state-of-the-art schemes through utilization of the straggler-tolerant technique and additional effective local updates at each worker, and the influence of system parameters is studied. By waiting for faster workers and allowing heterogeneous synchronization with different numbers of local updates across workers, STSyn provides substantial improvements both in time and communication efficiency.

artificial intelligence, local update, machine learning, (16 more...)

2210.03521

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

On the Global Convergence of Risk-Averse Policy Gradient Methods with Expected Conditional Risk Measures

Yu, Xian, Ying, Lei

Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence and iteration complexities of the corresponding risk-averse policy gradient algorithms. We further test risk-averse variants of REINFORCE and actor-critic algorithms to demonstrate the efficacy of our method and the importance of risk control.

machine learning, parameterization, reinforcement learning, (16 more...)

2301.10932

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
North America > United States > Ohio > Franklin County > Columbus (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Becker, Evan, Pandit, Parthe, Rangan, Sundeep, Fletcher, Alyson K.

Local Convergence of Gradient Descent-Ascent for Training Generative Adversarial Networks

Generative Adversarial Networks (GANs) are a popular formulation to train generative models for complex high dimensional data. The standard method for training GANs involves a gradient descent-ascent (GDA) procedure on a minimax optimization problem. This procedure is hard to analyze in general due to the nonlinear nature of the dynamics. We study the local dynamics of GDA for training a GAN with a kernel-based discriminator. This convergence analysis is based on a linearization of a non-linear dynamical system that describes the GDA iterations, under an \textit{isolated points model} assumption from [Becker et al. 2022]. Our analysis brings out the effect of the learning rates, regularization, and the bandwidth of the kernel discriminator, on the local convergence rate of GDA. Importantly, we show phase transitions that indicate when the system converges, oscillates, or diverges. We also provide numerical simulations that verify our claims.

artificial intelligence, equation, machine learning, (17 more...)

2305.08277

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.61)

arXiv.org Machine LearningMay-29-2023

An Accelerated Stochastic Algorithm for Solving the Optimal Transport Problem

Xie, Yiling, Luo, Yiling, Huo, Xiaoming

A primal-dual accelerated stochastic gradient descent with variance reduction algorithm (PDASGD) is proposed to solve linear-constrained optimization problems. PDASGD could be applied to solve the discrete optimal transport (OT) problem and enjoys the best-known computational complexity -- $\widetilde{\mathcal{O}}(n^2/\epsilon)$, where $n$ is the number of atoms, and $\epsilon>0$ is the accuracy. In the literature, some primal-dual accelerated first-order algorithms, e.g., APDAGD, have been proposed and have the order of $\widetilde{\mathcal{O}}(n^{2.5}/\epsilon)$ for solving the OT problem. To understand why our proposed algorithm could improve the rate by a factor of $\widetilde{\mathcal{O}}(\sqrt{n})$, the conditions under which our stochastic algorithm has a lower order of computational complexity for solving linear-constrained optimization problems are discussed. It is demonstrated that the OT problem could satisfy the aforementioned conditions. Numerical experiments demonstrate superior practical performances of the proposed PDASGD algorithm for solving the OT problem.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

2203.00813

Country:

Asia > Middle East > Jordan (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)