o-minimal structure
Directional convergence and alignment in deep learning
The above theories, with finite width networks, usually require the weights to stay close to initialization in certain norms. By contrast, practitioners run their optimization methods as long as their computational budget allows [Shallue et al., 2018], and if the data can be perfectly classified, the
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan (0.04)
Directional convergence and alignment in deep learning
The above theories, with finite width networks, usually require the weights to stay close to initialization in certain norms. By contrast, practitioners run their optimization methods as long as their computational budget allows [Shallue et al., 2018], and if the data can be perfectly classified, the
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan (0.04)
7a674153c63cff1ad7f0e261c369ab2c-Supplemental.pdf
This is the appendix for "A mathematical model for automatic differentiation in machine learning". We propose to study backward mode of AD, as implemented for nonsmooth functions by standard software (e.g. Our theoretical results model AD as implemented in current machine learning libraries. The conclusion follows because f p y q f px q " For each i " 1 ...,m and j " 1,...,l, consider the set U We recall here the results of geometry that we use in the present work. The simplest o-minimal structure is given by the class of real semialgebraic objects. The following can be found for example in [21]. D p x q " tgrad f p xqu, (10) where grad f p x q is the gradient of f restricted to the active strata M Then the following are equivalent D is conservative for f .
Deep Learning as the Disciplined Construction of Tame Objects
Bareilles, Gilles, Gehret, Allen, Aspman, Johannes, Lepšová, Jana, Mareček, Jakub
One can see deep-learning models as compositions of functions within the so-called tame geometry. In this expository note, we give an overview of some topics at the interface of tame geometry (also known as o-minimality), optimization theory, and deep learning theory and practice. To do so, we gradually introduce the concepts and tools used to build convergence guarantees for stochastic gradient descent in a general nonsmooth nonconvex, but tame, setting. This illustrates some ways in which tame geometry is a natural mathematical framework for the study of AI systems, especially within Deep Learning.
- North America > United States > Oklahoma (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > New York (0.04)
- (16 more...)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Japan (0.04)
Directional convergence and alignment in deep learning
The above theories, with finite width networks, usually require the weights to stay close to initialization in certain norms. By contrast, practitioners run their optimization methods as long as their computational budget allows [Shallue et al., 2018], and if the data can be perfectly classified, the
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan (0.04)
70afbf2259b4449d8ae1429e054df1b1-Supplemental.pdf
This is the appendix for "Nonsmooth Implicit Differentiation for Machine Learning and Optimization". We recall basic definitions and results on definable sets and functions used in this work. The archetypal o-minimal structure is the collection of semialgebraic sets. R is a polynomial function. Note that the collection of semialgebraic sets verifies 3 in Definition 6 according to the Tarski-Seidenberg theorem.
SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures
Kranz, Julian, Gallon, Davide, Dereich, Steffen, Jentzen, Arnulf
We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the loss value of any gradient flow initialized at most $\varepsilon$ above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to real-world scenarios, where we observe an analogous behavior.
- Europe > Germany (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (3 more...)