to

### Gradient Descent for Machine Learning (ML) 101 with Python Tutorial

Gradient descent is one of the most common machine learning algorithms used in neural networks [7], data science, optimization, and machine learning tasks. The gradient descent algorithm and its variants can be found in almost every machine learning model. Gradient descent is a popular optimization method of tuning the parameters in a machine learning model. Its goal is to apply optimization to find the least or minimal error value. It is mostly used to update the parameters of the model -- in this case, parameters refer to coefficients in regression and weights in a neural network.

### Householder Dice: A Matrix-Free Algorithm for Simulating Dynamics on Gaussian and Random Orthogonal Ensembles

In the study of large random systems, researchers often need to simulate dynamics in the form of iterated matrix-vector multiplications interspersed with nonlinear operations. Examples include message passing algorithms, gradient descent, and matrix iterative methods for extremal eigenvalue calculations. This paper proposes a new algorithm, named Householder Dice (HD), for simulating such dynamics on several random matrix ensembles with translation-invariant properties. Examples include the Gaussian ensemble, the Haar-distributed random orthogonal ensemble, and their complex-valued counterparts. A "direct" approach to the simulation, where one first generates a dense $n \times n$ matrix from the ensemble, requires at least $\mathcal{O}(n^2)$ resource in space and time. The HD algorithm overcomes this $\mathcal{O}(n^2)$ bottleneck by using the principle of deferred decisions: rather than fixing the entire random matrix in advance, it lets the randomness unfold with the dynamics. Key to this matrix-free construction is an adaptive and recursive construction of (random) Householder reflectors. These orthogonal transformations exploit the group symmetry of the matrix ensembles, while simultaneously maintaining the statistical correlations induced by the dynamics. The memory and computation costs of the HD algorithm are $\mathcal{O}(nT)$ and $\mathcal{O}(nT^2)$, respectively, with $T$ being the number of iterations. When $T \ll n$, which is nearly always the case in practice, the HD algorithm leads to significant reductions in runtime and memory footprint. Numerical results demonstrate the promise of the new algorithm as a new computational tool in the study of high-dimensional random systems.

### Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise

We consider a one-hidden-layer leaky ReLU network of arbitrary width trained by stochastic gradient descent following an arbitrary initialization. We prove that stochastic gradient descent (SGD) produces neural networks that have classification accuracy competitive with that of the best halfspace over the distribution for a broad class of distributions that includes log-concave isotropic and hard margin distributions. Equivalently, such networks can generalize when the data distribution is linearly separable but corrupted with adversarial label noise, despite the capacity to overfit. We conduct experiments which suggest that for some distributions our generalization bounds are nearly tight. This is the first result that shows that overparameterized neural networks trained by SGD can generalize when the data is corrupted with adversarial label noise.

### Learning with Gradient Descent and Weakly Convex Losses

We study the learning performance of gradient descent when the empirical risk is weakly convex, namely, the smallest negative eigenvalue of the empirical risk's Hessian is bounded in magnitude. By showing that this eigenvalue can control the stability of gradient descent, generalisation error bounds are proven that hold under a wider range of step sizes compared to previous work. Out of sample guarantees are then achieved by decomposing the test error into generalisation, optimisation and approximation errors, each of which can be bounded and traded off with respect to algorithmic parameters, sample size and magnitude of this eigenvalue. In the case of a two layer neural network, we demonstrate that the empirical risk can satisfy a notion of local weak convexity, specifically, the Hessian's smallest eigenvalue during training can be controlled by the normalisation of the layers, i.e., network scaling. This allows test error guarantees to then be achieved when the population risk minimiser satisfies a complexity assumption. By trading off the network complexity and scaling, insights are gained into the implicit bias of neural network scaling, which are further supported by experimental findings.

### Solving Min-Max Optimization with Hidden Structure via Gradient Descent Ascent

Many recent AI architectures are inspired by zero-sum games, however, the behavior of their dynamics is still not well understood. Inspired by this, we study standard gradient descent ascent (GDA) dynamics in a specific class of non-convex non-concave zero-sum games, that we call hidden zero-sum games. In this class, players control the inputs of smooth but possibly non-linear functions whose outputs are being applied as inputs to a convex-concave game. Unlike general zero-sum games, these games have a well-defined notion of solution; outcomes that implement the von-Neumann equilibrium of the "hidden" convex-concave game. We prove that if the hidden game is strictly convex-concave then vanilla GDA converges not merely to local Nash, but typically to the von-Neumann solution. If the game lacks strict convexity properties, GDA may fail to converge to any equilibrium, however, by applying standard regularization techniques we can prove convergence to a von-Neumann solution of a slightly perturbed zero-sum game. Our convergence guarantees are non-local, which as far as we know is a first-of-its-kind type of result in non-convex non-concave games. Finally, we discuss connections of our framework with generative adversarial networks.

### Beyond Procrustes: Balancing-Free Gradient Descent for Asymmetric Low-Rank Matrix Sensing

Low-rank matrix estimation plays a central role in various applications across science and engineering. Recently, nonconvex formulations based on matrix factorization are provably solved by simple gradient descent algorithms with strong computational and statistical guarantees. However, when the low-rank matrices are asymmetric, existing approaches rely on adding a regularization term to balance the scale of the two matrix factors which in practice can be removed safely without hurting the performance when initialized via the spectral method. In this paper, we provide a theoretical justification to this for the matrix sensing problem, which aims to recover a low-rank matrix from a small number of linear measurements. As long as the measurement ensemble satisfies the restricted isometry property, gradient descent -- in conjunction with spectral initialization -- converges linearly without the need of explicitly promoting balancedness of the factors; in fact, the factors stay balanced automatically throughout the execution of the algorithm. Our analysis is based on analyzing the evolution of a new distance metric that directly accounts for the ambiguity due to invertible transforms, and might be of independent interest.