Goto

Collaborating Authors

 Petersen, Philipp Christian


Regularized Gauss-Newton for Optimizing Overparameterized Neural Networks

arXiv.org Artificial Intelligence

Despite their superior convergence rates compared to first-order methods, (approximate) second-order methods are still rarely used -- and as such, underexplored -- for training large-scale machine learning and neural network (NN) models. This is due to their highly prohibitive computations and memory footprints at each iteration. Some past and recent works have, however, made efforts to reduce this overhead by proposing different approximations to the Hessian of the loss function, which the methods ultimately exploit to achieve their impressive convergence properties (see e.g., [1, 2, 3, 4, 5, 6, 7, 8, 9]). One of the most appealing approximations to the Hessian matrix within the context of practical deep learning and nonlinear optimization in general is the generalized Gauss-Newton (GGN) approximation of [10], which uses a positive semi-definite (PSD) matrix to model the curvature about an arbitrary convex loss function. In fact, the Fisher information matrix (FIM) -- a curvature approximating matrix which most other approximate second-order methods seek to estimate -- is shown to have direct connections with the GGN matrix in many practical cases [4, 11]. Despite its close connection with the GGN matrix, the FIM, unlike the GGN matrix, potentially leads to over-approximating the second-order terms in more general loss functions, throwing away relevant curvature information [10]. In addition to the desirable property of maintaining positive-definiteness throughout the training procedure, other nice properties of the GGN matrix, in comparison with the Hessian matrix, are discussed in [12, Section 8.1]; see also [13] for discussions in the context of nonlinear least-squares estimation and [14] for efficient training of (deep) recurrent neural networks with a GGN approach.


Efficient Learning Using Spiking Neural Networks Equipped With Affine Encoders and Decoders

arXiv.org Machine Learning

Deep learning [6, 29] is a technology that has revolutionized many areas of modern life. The term describes the gradient-based training of deep neural networks. Since its breakthrough in image classification in 2012 [28], deep learning is essentially the only viable technology for this application. Moreover, it is the basis of multiple recent breakthroughs in science [25] and even mathematical research [14]. Recently, deep learning has received wide public attention through the advent of generative AI in the form of large language models such as ChatGPT [39]. It is well-documented that deep learning in modern applications can have extreme requirements on computational resources and the hardware requirements scale in an unsustainable way [52]. In constrained settings, this can become a serious bottleneck preventing the employment of deep learning methods. In addition, these comprehensive computations come with an immense environmental cost.


Limitations of neural network training due to numerical instability of backpropagation

arXiv.org Machine Learning

Deep learning is a machine learning technique based on artificial neural networks which are trained by gradient-based methods and which have a large number of layers. This technique has been tremendously successful in a wide range of applications [26, 24, 44, 41]. Of particular interest for applied mathematicians are recent developments in which deep neural networks are applied to tasks of numerical analysis such as the numerical solution of inverse problems [1, 34, 27, 20, 38] or of (parametric) partial differential equations [7, 12, 39, 9, 40, 25, 29, 3]. The appeal of deep neural networks for these applications is due to their exceptional efficiency in representing functions from several approximation classes that underlie well-established numerical methods. In terms of approximation accuracy with respect to the number of approximation parameters, deep neural networks have been theoretically proven to achieve approximation rates that are at least as good as those of finite elements [15, 35, 30], local Taylor polynomials or splines [47, 11], wavelets [42] and, more generally, affine systems [5]. In the sequel, we consider neural networks with the rectified-linear-unit (ReLU) activation function, which is standard in most applications. In this case, the neural-network approximations are piecewiseaffine functions. We point out that all state-of-the-art results on the rates of approximation with deep ReLU neural networks that achieve higher order polynomial approximation rates are based on explicit constructions with the number of affine pieces growing exponentially with respect to the number of layers; see, e.g., [47, 46]. In this work, we argue that this central building block, functions with exponentially many affine pieces, cannot be learned with the state-of-the-art techniques.


Deep neural networks can stably solve high-dimensional, noisy, non-linear inverse problems

arXiv.org Machine Learning

We study the problem of reconstructing solutions of inverse problems when only noisy measurements are available. We assume that the problem can be modeled with an infinite-dimensional forward operator that is not continuously invertible. Then, we restrict this forward operator to finite-dimensional spaces so that the inverse is Lipschitz continuous. For the inverse operator, we demonstrate that there exists a neural network which is a robust-to-noise approximation of the operator. In addition, we show that these neural networks can be learned from appropriately perturbed training data. We demonstrate the admissibility of this approach to a wide range of inverse problems of practical interest. Numerical examples are given that support the theoretical findings.


Mathematical Capabilities of ChatGPT

arXiv.org Artificial Intelligence

We investigate the mathematical capabilities of two iterations of ChatGPT (released 9-January-2023 and 30-January-2023) and of GPT-4 by testing them on publicly available datasets, as well as hand-crafted ones, using a novel methodology. In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, either cover only elementary mathematics or are very small. We address this by publicly releasing two new datasets: GHOSTS and miniGHOSTS. These are the first natural-language datasets curated by working researchers in mathematics that (1) aim to cover graduate-level mathematics, (2) provide a holistic overview of the mathematical capabilities of language models, and (3) distinguish multiple dimensions of mathematical reasoning. These datasets also test whether ChatGPT and GPT-4 can be helpful assistants to professional mathematicians by emulating use cases that arise in the daily professional activities of mathematicians. We benchmark the models on a range of fine-grained performance metrics. For advanced mathematics, this is the most detailed evaluation effort to date. We find that ChatGPT can be used most successfully as a mathematical assistant for querying facts, acting as a mathematical search engine and knowledge base interface. GPT-4 can additionally be used for undergraduate-level mathematics but fails on graduate-level difficulty. Contrary to many positive reports in the media about GPT-4 and ChatGPT's exam-solving abilities (a potential case of selection bias), their overall mathematical performance is well below the level of a graduate student. Hence, if your goal is to use ChatGPT to pass a graduate-level math exam, you would be better off copying from your average peer!


VC dimensions of group convolutional neural networks

arXiv.org Artificial Intelligence

Due to impressive results in image recognition, convolutional neural networks (CNNs) have become one of the most widely-used neural network architectures [12, 13]. It is believed that one of the main reasons for the efficiency of CNNs is their ability to convert translation symmetry of the data into a built-in translationequivariance property of the neural network without exhausting the data to learn the equivariance [4, 15]. Based on this intuition, other data symmetries have recently been incorporated into neural network architectures. Group convolutional neural networks (G-CNNs) are a natural generalization of CNNs that can be equivariant with respect to rotation [5, 24, 23, 9], scale [21, 20, 1], and other symmetries defined by matrix groups [7]. Moreover, every neural network that is equivariant to the action of a group on its input is a G-CNN, where the convolutions are with respect to the group, [11] (see Theorem 2.10 below).


The Oracle of DLphi

arXiv.org Machine Learning

This paper takes aim at achieving nothing less than the impossible. To be more precise, we seek to predict labels of unknown data from entirely uncorrelated labelled training data. This will be accomplished by an application of an algorithm based on deep learning, as well as, by invoking one of the most fundamental concepts of set theory. Estimating the behaviour of a system in unknown situations is one of the central problems of humanity. Indeed, we are constantly trying to produce predictions for future events to be able to prepare ourselves.