Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation
Eberle, Simon, Jentzen, Arnulf, Riekert, Adrian, Weiss, Georg S.
–arXiv.org Artificial Intelligence
The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure which appears, for instance, in the context of natural language processing, face recognition, fraud detection, and game intelligence. Although there exist a large number of numerical simulations in which GD type optimization schemes are effectively used to train ANNs with ReLU activation, till this day in the scientific literature there is in general no mathematical convergence analysis which explains the success of GD type optimization schemes in the training of such ANNs. GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods. Although there is in general no theoretical analysis which explains the success of GD type optimization schemes in the training of ANNs in the literature, there are several auspicious analysis approaches as well as several promising partial error analyses regarding the training of ANNs via GD type optimization schemes and GFs, respectively, in the literature. For convex objective functions, the convergence of GF and GD processes to the global minimum in different settings has been proved, e.g., in [5, 23, 34, 35, 38]. For general non-convex objective functions, even under smoothness assumptions GF and GD processes can show wild oscillations and admit infinitely many limit points, cf., e.g., [1]. A standard condition which excludes this undesirable behavior is the Lojasiewicz inequality and we point to [1, 3, 4, 8, 16, 28, 29, 30, 31, 33, 36] for convergence results for GF and GD processes under Lojasiewicz type assumptions.
arXiv.org Artificial Intelligence
Aug-18-2021
- Country:
- North America > United States
- Wisconsin > Dane County
- Madison (0.04)
- New York > New York County
- New York City (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California > Los Angeles County
- Long Beach (0.04)
- Wisconsin > Dane County
- Europe
- Italy > Emilia-Romagna
- Metropolitan City of Bologna > Bologna (0.04)
- Germany > North Rhine-Westphalia
- Münster Region > Münster (0.04)
- Italy > Emilia-Romagna
- Asia
- Middle East > Jordan (0.04)
- China
- Guangdong Province > Shenzhen (0.04)
- Hong Kong (0.04)
- North America > United States
- Genre:
- Research Report (0.40)
- Technology: