path differentiable
70afbf2259b4449d8ae1429e054df1b1-Supplemental.pdf
This is the appendix for "Nonsmooth Implicit Differentiation for Machine Learning and Optimization". We recall basic definitions and results on definable sets and functions used in this work. The archetypal o-minimal structure is the collection of semialgebraic sets. R is a polynomial function. Note that the collection of semialgebraic sets verifies 3 in Definition 6 according to the Tarski-Seidenberg theorem.
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.06)
- Asia > Middle East > Israel (0.04)
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses
Optimal Transport has sparked vivid interest in recent years, in particular thanks to the Wasserstein distance, which provides a geometrically sensible and intuitive way of comparing probability measures. For computational reasons, the Sliced Wasserstein (SW) distance was introduced as an alternative to the Wasserstein distance, and has seen uses for training generative Neural Networks (NNs). While convergence of Stochastic Gradient Descent (SGD) has been observed practically in such a setting, there is to our knowledge no theoretical guarantee for this observation. Leveraging recent works on convergence of SGD on non-smooth and non-convex functions by Bianchi et al. (2022), we aim to bridge that knowledge gap, and provide a realistic context under which fixed-step SGD trajectories for the SW loss on NN parameters converge. More precisely, we show that the trajectories approach the set of (sub)-gradient flow equations as the step decreases. Under stricter assumptions, we show a much stronger convergence result for noised and projected SGD schemes, namely that the long-run limits of the trajectories approach a set of generalised critical points of the loss function.
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning
Bolte, Jérôme, Pauwels, Edouard
The Clarke subdifferential is not suited to tackle nonsmooth deep learning issues: backpropagation, mini-batches and steady states are not properly modelled. As a remedy, we introduce set valued conservative fields as surrogates to standard subdifferential mappings. We study their properties and provide elements of a calculus. Functions having a conservative field are called path differentiable. Convex/concave, semi-algebraic, or Clarke regular Lipschitz continuous functions are path differentiable as their corresponding subdifferentials are conservative. Another concrete and considerable class of examples of conservative fields, which are not subdifferential mappings, is given by the automatic differentiation oracle, as for instance the "subgradients" provided by the backpropagation algorithm in deep learning. Our differential model is eventually used to ensure subsequential convergence for nonsmooth stochastic gradient methods in the tame Lipschitz continuous setting offering the possibility of using mini-batches, the actual backpropagation oracle and $o(1/\log k)$ stepsizes.
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
- North America > United States > Texas > Kleberg County (0.04)
- North America > United States > Texas > Chambers County (0.04)
- (3 more...)