AITopics

2307.05906

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

arXiv.org Artificial IntelligenceJul-12-2023

spred: Solving $L_1$ Penalty with SGD

Ziyin, Liu, Wang, Zihao

We propose to minimize a generic differentiable objective with $L_1$ constraint using a simple reparametrization and straightforward stochastic gradient descent. Our proposal is the direct generalization of previous ideas that the $L_1$ penalty may be equivalent to a differentiable reparametrization with weight decay. We prove that the proposed method, \textit{spred}, is an exact differentiable solver of $L_1$ and that the reparametrization trick is completely ``benign" for a generic nonconvex function. Practically, we demonstrate the usefulness of the method in (1) training sparse neural networks to perform gene selection tasks, which involves finding relevant features in a very high dimensional space, and (2) neural network compression task, to which previous attempts at applying the $L_1$-penalty have been unsuccessful. Conceptually, our result bridges the gap between the sparsity in deep learning and conventional statistical learning.

artificial intelligence, deep learning, machine learning, (16 more...)

2210.01212

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Li, Shiying, Moosmueller, Caroline

Measure transfer via stochastic slicing and matching

arXiv.org Machine LearningJul-11-2023

This paper studies iterative schemes for measure transfer and approximation problems, which are defined through a slicing-and-matching procedure. Similar to the sliced Wasserstein distance, these schemes benefit from the availability of closed-form solutions for the one-dimensional optimal transport problem and the associated computational advantages. While such schemes have already been successfully utilized in data science applications, not too many results on their convergence are available. The main contribution of this paper is an almost sure convergence proof for stochastic slicing-and-matching schemes. The proof builds on an interpretation as a stochastic gradient descent scheme on the Wasserstein space. Numerical examples on step-wise image morphing are demonstrated as well.

artificial intelligence, machine learning, wasserstein distance, (16 more...)

arXiv.org Machine Learning

2307.05705

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
North America > United States > Michigan (0.04)
(2 more...)

Genre: Research Report (0.84)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.49)

Rosca, Mihaela, Deisenroth, Marc Peter

Implicit regularisation in stochastic gradient descent: from single-objective to two-player games

Recent years have seen many insights on deep learning optimisation being brought forward by finding implicit regularisation effects of commonly used gradient-based optimisers. Understanding implicit regularisation can not only shed light on optimisation dynamics, but it can also be used to improve performance and stability across problem domains, from supervised learning to two-player games such as Generative Adversarial Networks. An avenue for finding such implicit regularisation effects has been quantifying the discretisation errors of discrete optimisers via continuous-time flows constructed by backward error analysis (BEA). The current usage of BEA is not without limitations, since not all the vector fields of continuous-time flows obtained using BEA can be written as a gradient, hindering the construction of modified losses revealing implicit regularisers. In this work, we provide a novel approach to use BEA, and show how our approach can be used to construct continuous-time flows with vector fields that can be written as gradients. We then use this to find previously unknown implicit regularisation effects, such as those induced by multiple stochastic gradient descent steps while accounting for the exact data batches used in the updates, and in generally differentiable two-player games.

artificial intelligence, machine learning, mplicit regularisation, (13 more...)

2307.05789

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Pointwise convergence theorem of gradient descent in sparse deep neural network

Yoneda, Tsuyoshi

The theoretical structure of deep neural network (DNN) has been clarified gradually. Imaizumi-Fukumizu (2019) and Suzuki (2019) clarified that the learning ability of DNN is superior to the previous theories when the target function is non-smooth functions. However, as far as the author is aware, none of the numerous works to date attempted to mathematically investigate what kind of DNN architectures really induce pointwise convergence of gradient descent (without any statistical argument), and this attempt seems to be closer to the practical DNNs. In this paper we restrict target functions to non-smooth indicator functions, and construct a deep neural network inducing pointwise convergence provided by gradient descent process in ReLU-DNN. The DNN has a sparse and a special shape, with certain variable transformations.

artificial intelligence, machine learning, target function, (16 more...)

2304.08172

Country:

Asia > Japan > Honshū > Tōhoku (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre:

Research Report (0.64)
Instructional Material (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)

Borovykh, Anastasia, Kantas, Nikolas, Parpas, Panos, Pavliotis, Greg

Privacy Risk for anisotropic Langevin dynamics using relative entropy bounds

The privacy preserving properties of Langevin dynamics with additive isotropic noise have been extensively studied. However, the isotropic noise assumption is very restrictive: (a) when adding noise to existing learning algorithms to preserve privacy and maintain the best possible accuracy one should take into account the relative magnitude of the outputs and their correlations; (b) popular algorithms such as stochastic gradient descent (and their continuous time limits) appear to possess anisotropic covariance properties. To study the privacy risks for the anisotropic noise case, one requires general results on the relative entropy between the laws of two Stochastic Differential Equations with different drifts and diffusion coefficients. Our main contribution is to establish such a bound using stability estimates for solutions to the Fokker-Planck equations via functional inequalities. With additional assumptions, the relative entropy bound implies an $(\epsilon,\delta)$-differential privacy bound or translates to bounds on the membership inference attack success and we show how anisotropic noise can lead to better privacy-accuracy trade-offs. Finally, the benefits of anisotropic noise are illustrated using numerical results in quadratic loss and neural network setups.

artificial intelligence, machine learning, relative entropy, (14 more...)

2302.00766

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Chen, Xuxing, Balasubramanian, Krishnakumar, Ghadimi, Saeed

Stochastic Nested Compositional Bi-level Optimization for Robust Feature Learning

We develop and analyze stochastic approximation algorithms for solving nested compositional bi-level optimization problems. These problems involve a nested composition of $T$ potentially non-convex smooth functions in the upper-level, and a smooth and strongly convex function in the lower-level. Our proposed algorithm does not rely on matrix inversions or mini-batches and can achieve an $\epsilon$-stationary solution with an oracle complexity of approximately $\tilde{O}_T(1/\epsilon^{2})$, assuming the availability of stochastic first-order oracles for the individual functions in the composition and the lower-level, which are unbiased and have bounded moments. Here, $\tilde{O}_T$ hides polylog factors and constants that depend on $T$. The key challenge we address in establishing this result relates to handling three distinct sources of bias in the stochastic gradients. The first source arises from the compositional nature of the upper-level, the second stems from the bi-level structure, and the third emerges due to the utilization of Neumann series approximations to avoid matrix inversion. To demonstrate the effectiveness of our approach, we apply it to the problem of robust feature learning for deep neural networks under covariate shift, showcasing the benefits and advantages of our methodology in that context.

artificial intelligence, machine learning, optimization, (18 more...)

2307.05384

Country:

North America > United States > California > Yolo County > Davis (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks

Cao, Yuan, Zou, Difan, Li, Yuanzhi, Gu, Quanquan

We study the implicit bias of batch normalization trained by gradient descent. We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-\Omega(\log^2 t))$ convergence rate. This distinguishes linear models with batch normalization from those without batch normalization in terms of both the type of implicit bias and the convergence rate. We further extend our result to a class of two-layer, single-filter linear convolutional neural networks, and show that batch normalization has an implicit bias towards a patch-wise uniform margin. Based on two examples, we demonstrate that patch-wise uniform margin classifiers can outperform the maximum margin classifiers in certain learning problems. Our results contribute to a better theoretical understanding of batch normalization.

artificial intelligence, deep learning, machine learning, (15 more...)

2306.1168

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.27)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

arXiv.org Artificial IntelligenceJul-10-2023

Online Tensor Learning: Computational and Statistical Trade-offs, Adaptivity and Optimal Regret

Cai, Jian-Feng, Li, Jingyang, Xia, Dong

We investigate a generalized framework for estimating latent low-rank tensors in an online setting, encompassing both linear and generalized linear models. This framework offers a flexible approach for handling continuous or categorical variables. Additionally, we investigate two specific applications: online tensor completion and online binary tensor learning. To address these challenges, we propose the online Riemannian gradient descent algorithm, which demonstrates linear convergence and the ability to recover the low-rank component under appropriate conditions in all applications. Furthermore, we establish a precise entry-wise error bound for online tensor completion. Notably, our work represents the first attempt to incorporate noise in the online low-rank tensor recovery task. Intriguingly, we observe a surprising trade-off between computational and statistical aspects in the presence of noise. Increasing the step size accelerates convergence but leads to higher statistical error, whereas a smaller step size yields a statistically optimal estimator at the expense of slower convergence. Moreover, we conduct regret analysis for online tensor regression. Under the fixed step size regime, a fascinating trilemma concerning the convergence rate, statistical error rate, and regret is observed. With an optimal choice of step size we achieve an optimal regret of $O(\sqrt{T})$. Furthermore, we extend our analysis to the adaptive setting where the horizon T is unknown. In this case, we demonstrate that by employing different step sizes, we can attain a statistically optimal error rate along with a regret of $O(\log T)$. To validate our theoretical claims, we provide numerical results that corroborate our findings and support our assertions.

artificial intelligence, machine learning, nullt 0, (17 more...)

2306.03372

Country:

Africa > Senegal > Kolda Region > Kolda (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Hong Kong (0.04)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Education > Curriculum > Subject-Specific Education (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Iordan, Rares, Deisenroth, Marc Peter, Rosca, Mihaela

Investigating the Edge of Stability Phenomenon in Reinforcement Learning

arXiv.org Artificial IntelligenceJul-9-2023

Recent progress has been made in understanding optimisation dynamics in neural networks trained with full-batch gradient descent with momentum with the uncovering of the edge of stability phenomenon in supervised learning. The edge of stability phenomenon occurs as the leading eigenvalue of the Hessian reaches the divergence threshold of the underlying optimisation algorithm for a quadratic loss, after which it starts oscillating around the threshold, and the loss starts to exhibit local instability but decreases over long time frames. In this work, we explore the edge of stability phenomenon in reinforcement learning (RL), specifically off-policy Q-learning algorithms across a variety of data regimes, from offline to online RL. Our experiments reveal that, despite significant differences to supervised learning, such as non-stationarity of the data distribution and the use of bootstrapping, the edge of stability phenomenon can be present in off-policy deep RL. Unlike supervised learning, however, we observe strong differences depending on the underlying loss, with DQN -- using a Huber loss -- showing a strong edge of stability effect that we do not observe with C51 -- using a cross entropy loss. Our results suggest that, while neural network structure can lead to optimisation dynamics that transfer between problem domains, certain aspects of deep RL optimisation can differentiate it from domains such as supervised learning.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2307.0421

Country:

North America > Canada > Alberta (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)