AITopics

2503.10428

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

arXiv.org Artificial IntelligenceApr-12-2024

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Tucat, Matteo, Mukherjee, Anirbit

In this work, we instantiate a regularized form of the gradient clipping algorithm and prove that it can converge to the global minima of deep neural network loss functions provided that the net is of sufficient width. We present empirical evidence that our theoretically founded regularized gradient clipping algorithm is also competitive with the state-of-the-art deep-learning heuristics. Hence the algorithm presented here constitutes a new approach to rigorous deep learning. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Łojasiewicz inequality which was recently proven (Liu et al., 2020), to be true for various neural networks for any depth within a neighbourhood of the initialisation. In various disciplines, ranging from control theory to machine learning theory there has been a long history of trying to understand the nature of convergence on non-convex objectives for first order optimization algorithms i.e algorithms which only have access to an (estimate of) the gradient of the objective Maryak & Chin (2001); Fang et al. (1997).

algorithm, artificial intelligence, machine learning, (19 more...)

2404.08624

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California (0.14)

Genre:

Research Report (0.51)
Overview (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-20-2023

Size Lowerbounds for Deep Operator Networks

Mukherjee, Anirbit, Roy, Amartya

Deep Operator Networks are an increasingly popular paradigm for solving regression in infinite dimensions and hence solve families of PDEs in one shot. In this work, we aim to establish a first-of-its-kind data-dependent lowerbound on the size of DeepONets required for them to be able to reduce empirical error on noisy data. In particular, we show that for low training errors to be obtained on $n$ data points it is necessary that the common output dimension of the branch and the trunk net be scaling as $\Omega \left ( \sqrt[\leftroot{-1}\uproot{-1}6]{n} \right )$. This inspires our experiments with DeepONets solving the advection-diffusion-reaction PDE, where we demonstrate the possibility that at a fixed model size, to leverage increase in this common output dimension and get monotonic lowering of training error, the size of the training data might necessarily need to scale at least quadratically with it.

artificial intelligence, deep learning, machine learning, (16 more...)

2308.06338

Country:

Asia > India (0.14)
Europe > United Kingdom (0.14)

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-7-2023

LIPEx-Locally Interpretable Probabilistic Explanations-To Look Beyond The True Class

Zhu, Hongbo, Cangelosi, Angelo, Sen, Procheta, Mukherjee, Anirbit

In this work, we instantiate a novel perturbation-based multi-class explanation framework, LIPEx (Locally Interpretable Probabilistic Explanation). We demonstrate that LIPEx not only locally replicates the probability distributions output by the widely used complex classification models but also provides insight into how every feature deemed to be important affects the prediction probability for each of the possible classes. We achieve this by defining the explanation as a matrix obtained via regression with respect to the Hellinger distance in the space of probability distributions. Ablation tests on text and image data, show that LIPEx-guided removal of important features from the data causes more change in predictions for the underlying model than similar tests based on other saliency-based or feature importance-based Explainable AI (XAI) methods. It is also shown that compared to LIME, LIPEx is more data efficient in terms of using a lesser number of perturbations of the data to obtain a reliable explanation. This data-efficiency is seen to manifest as LIPEx being able to compute its explanation matrix around 53% faster than all-class LIME, for classification experiments with text data.

artificial intelligence, machine learning, natural language, (20 more...)

2310.04856

Country:

Europe > United Kingdom > England (0.14)
North America > United States > Maryland (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Industry:

Law (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.87)

arXiv.org Artificial IntelligenceOct-8-2023

Investigating the Ability of PINNs To Solve Burgers' PDE Near Finite-Time BlowUp

Kumar, Dibyakanti, Mukherjee, Anirbit

Physics Informed Neural Networks (PINNs) have been achieving ever newer feats of solving complicated PDEs numerically while offering an attractive trade-off between accuracy and speed of inference. A particularly challenging aspect of PDEs is that there exist simple PDEs which can evolve into singular solutions in finite time starting from smooth initial conditions. In recent times some striking experiments have suggested that PINNs might be good at even detecting such finite-time blow-ups. In this work, we embark on a program to investigate this stability of PINNs from a rigorous theoretical viewpoint. Firstly, we derive generalization bounds for PINNs for Burgers' PDE, in arbitrary dimensions, under conditions that allow for a finite-time blow-up. Then we demonstrate via experiments that our bounds are significantly correlated to the $\ell_2$-distance of the neurally found surrogate from the true blow-up solution, when computed on sequences of PDEs that are getting increasingly close to a blow-up.

ability, finite-time blowup, solve burger, (2 more...)

2310.05169

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

arXiv.org Machine LearningSep-17-2023

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Gopalani, Pulkit, Jha, Samyak, Mukherjee, Anirbit

In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives.

artificial intelligence, machine learning, neural network, (14 more...)

2309.09258

Country: North America > United States (0.28)

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceApr-8-2023

Global Convergence of SGD On Two Layer Neural Nets

Gopalani, Pulkit, Mukherjee, Anirbit

In this note we demonstrate provable convergence of SGD to the global minima of appropriately regularized $\ell_2-$empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates, if they are using adequately smooth and bounded activations like sigmoid and tanh. We build on the results in [1] and leverage a constant amount of Frobenius norm regularization on the weights, along with sampling of the initial weights from an appropriate distribution. We also give a continuous time SGD convergence result that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence loss functions on constant sized neural nets which are "Villani Functions". [1] Bin Shi, Weijie J. Su, and Michael I. Jordan. On learning rates and schr\"odinger operators, 2020. arXiv:2004.06977

artificial intelligence, machine learning, neural network, (16 more...)

2210.11452

Country:

North America (0.28)
Asia > Middle East > Jordan (0.24)

Genre: Research Report (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Machine LearningNov-1-2021

Investigating the locality of neural network training dynamics

Dan, Soham, Gampa, Phanideep, Mukherjee, Anirbit

A fundamental quest in the theory of deep-learning is to understand the properties of the trajectories in the weight space that a learning algorithm takes. One such property that had very recently been isolated is that of "local elasticity" ($S_{\rm rel}$), which quantifies the propagation of influence of a sampled data point on the prediction at another data point. In this work, we perform a comprehensive study of local elasticity by providing new theoretical insights and more careful empirical evidence of this property in a variety of settings. Firstly, specific to the classification setting, we suggest a new definition of the original idea of $S_{\rm rel}$. Via experiments on state-of-the-art neural networks training on SVHN, CIFAR-10 and CIFAR-100 we demonstrate how our new $S_{\rm rel}$ detects the property of the weight updates preferring to make changes in predictions within the same class of the sampled data. Next, we demonstrate via examples of neural nets doing regression that the original $S_{\rm rel}$ reveals a $2-$phase behaviour: that their training proceeds via an initial elastic phase when $S_{\rm rel}$ changes rapidly and an eventual inelastic phase when $S_{\rm rel}$ remains large. Lastly, we give multiple examples of learning via gradient flows for which one can get a closed-form expression of the original $S_{\rm rel}$ function. By studying the plots of these derived formulas we given a theoretical demonstration of some of the experimentally detected properties of $S_{\rm rel}$ in the regression setting.

artificial intelligence, machine learning, neural network, (15 more...)

2111.01166

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

arXiv.org Machine LearningApr-28-2021

A Study of the Mathematics of Deep Learning

Mukherjee, Anirbit

"Deep Learning"/"Deep Neural Nets" is a technological marvel that is now increasingly deployed at the cutting-edge of artificial intelligence tasks. This dramatic success of deep learning in the last few years has been hinged on an enormous amount of heuristics and it has turned out to be a serious mathematical challenge to be able to rigorously explain them. In this thesis, submitted to the Department of Applied Mathematics and Statistics, Johns Hopkins University we take several steps towards building strong theoretical foundations for these new paradigms of deep-learning. In chapter 2 we show new circuit complexity theorems for deep neural functions and prove classification theorems about these function spaces which in turn lead to exact algorithms for empirical risk minimization for depth 2 ReLU nets. We also motivate a measure of complexity of neural functions to constructively establish the existence of high-complexity neural functions. In chapter 3 we give the first algorithm which can train a ReLU gate in the realizable setting in linear time in an almost distribution free set up. In chapter 4 we give rigorous proofs towards explaining the phenomenon of autoencoders being able to do sparse-coding. In chapter 5 we give the first-of-its-kind proofs of convergence for stochastic and deterministic versions of the widely used adaptive gradient deep-learning algorithms, RMSProp and ADAM. This chapter also includes a detailed empirical study on autoencoders of the hyper-parameter values at which modern algorithms have a significant advantage over classical acceleration based methods. In the last chapter 6 we give new and improved PAC-Bayesian bounds for the risk of stochastic neural nets. This chapter also includes an experimental investigation revealing new geometric properties of the paths in weight space that are traced out by the net during the training.

deep learning, distributional assumption, neural network, (23 more...)

2104.14033

Country: North America > United States > Maryland (0.13)

Genre: Research Report > New Finding (0.92)

Industry:

Information Technology (0.92)
Leisure & Entertainment > Games (0.45)
Education > Curriculum > Subject-Specific Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.92)

arXiv.org Machine LearningAug-27-2020

A Study of Neural Training with Non-Gradient and Noise Assisted Gradient Methods

Mukherjee, Anirbit, Muthukumar, Ramchandran

Eventually this lead to an explosion of literature getting l inear time training of various kinds of neural nets when their width is a high degree polynomial in training set size, inverse accuracy and inverse confidence parameters (a somewhat unrealistic regime), [ 26 ], [ 39 ], [ 11 ], [ 37 ], [ 22 ], [ 17 ], [ 3 ], [ 2 ], [ 4 ], [ 10 ], [ 42 ], [ 43 ], [ 7 ], [ 8 ], [ 29 ], [ 6 ]. The essential essential proximity of this regime to kernel meth ods have been thought of separately in works like [ 1 ], [ 38 ] Even in the wake of this progress, it remains unclear as to how any of this can help establish rigorous guarantees about smaller neural networks or more pertinently for constant size neura l nets which is a regime closer to what is implemented in the real world.

artificial intelligence, neural network, null 2, (19 more...)

2005.04211

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)