Gradient Descent
Lower Bounds on the Generalization Error of Nonlinear Learning Models
Seroussi, Inbar, Zeitouni, Ofer
We study in this paper lower bounds for the generalization error of models derived from multi-layer neural networks, in the regime where the size of the layers is commensurate with the number of samples in the training data. We show that unbiased estimators have unacceptable performance for such nonlinear networks in this regime. We derive explicit generalization lower bounds for general biased estimators, in the cases of linear regression and of two-layered networks. In the linear case the bound is asymptotically tight. In the nonlinear case, we provide a comparison of our bounds with an empirical study of the stochastic gradient descent algorithm. The analysis uses elements from the theory of large random matrices.
8 Outstanding Papers At ICLR 2021
International Conference on Learning Representations (ICLR) recently announced the ICLR 2021 Outstanding Paper Awards winners. It recognised eight papers out of the 860 submitted this year. The papers were evaluated for both technical quality and the potential to create a practical impact. The committee was chaired by Ivan Titov (U. This paper deals with parameterising hypercomplex multiplications using arbitrarily learnable parameters compared with the fully-connected layer counterpart.
Relieving the Plateau: Active Semi-Supervised Learning for a Better Landscape
Kong, Seo Taek, Jeon, Soomin, Lee, Jaewon, Lee, Hongseok, Jung, Kyu-Hwan
Deep learning (DL) relies on massive amounts of labeled data, and improving its labeled sample-efficiency remains one of the most important problems since its advent. Semi-supervised learning (SSL) leverages unlabeled data that are more accessible than their labeled counterparts. Active learning (AL) selects unlabeled instances to be annotated by a human-in-the-loop in hopes of better performance with less labeled data. Given the accessible pool of unlabeled data in pool-based AL, it seems natural to use SSL when training and AL to update the labeled set; however, algorithms designed for their combination remain limited. In this work, we first prove that convergence of gradient descent on sufficiently wide ReLU networks can be expressed in terms of their Gram matrix' eigen-spectrum. Equipped with a few theoretical insights, we propose convergence rate control (CRC), an AL algorithm that selects unlabeled data to improve the problem conditioning upon inclusion to the labeled set, by formulating an acquisition step in terms of improving training dynamics. Extensive experiments show that SSL algorithms coupled with CRC can achieve high performance using very few labeled data.
Exponentiated Gradient Reweighting for Robust Training Under Label Noise and Beyond
Majidi, Negin, Amid, Ehsan, Talebi, Hossein, Warmuth, Manfred K.
Many learning tasks in machine learning can be viewed as taking a gradient step towards minimizing the average loss of a batch of examples in each training iteration. When noise is prevalent in the data, this uniform treatment of examples can lead to overfitting to noisy examples with larger loss values and result in poor generalization. Inspired by the expert setting in on-line learning, we present a flexible approach to learning from noisy examples. Specifically, we treat each training example as an expert and maintain a distribution over all examples. We alternate between updating the parameters of the model using gradient descent and updating the example weights using the exponentiated gradient update. Unlike other related methods, our approach handles a general class of loss functions and can be applied to a wide range of noise types and applications. We show the efficacy of our approach for multiple learning settings, namely noisy principal component analysis and a variety of noisy classification problems.
Neurons learn slower than they think
Recent studies revealed complex convergence dynamics in gradient-based methods, which has been little understood so far. Changing the step size to balance between high convergence rate and small generalization error may not be sufficient: maximizing the test accuracy usually requires a larger learning rate than minimizing the training loss. To explore the dynamic bounds of convergence rate, this study introduces \textit{differential capability} into an optimization process, which measures whether the test accuracy increases as fast as a model approaches the decision boundary in a classification problem. The convergence analysis showed that: 1) a higher convergence rate leads to slower capability growth; 2) a lower convergence rate results in faster capability growth and decay; 3) regulating a convergence rate in either direction reduces differential capability.
A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions
Jentzen, Arnulf, Riekert, Adrian
In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural networks consist of one input layer, one hidden layer, and one output layer (with $d \in \mathbb{N}$ neurons on the input layer, $H \in \mathbb{N}$ neurons on the hidden layer, and one neuron on the output layer). The learning rates of the SGD process are assumed to be sufficiently small and the input data used in the SGD process to train the artificial neural networks is assumed to be independent and identically distributed.
GABO: Graph Augmentations with Bi-level Optimization
Chung, Heejung W., Datta, Avoy, Waites, Chris
Data augmentation refers to a wide range of techniques for improving model generalization by augmenting training examples. Oftentimes such methods require domain knowledge about the dataset at hand, spawning a plethora of recent literature surrounding automated techniques for data augmentation. In this work we apply one such method, bilevel optimization, to tackle the problem of graph classification on the ogbg-molhiv dataset. Our best performing augmentation achieved a test ROCAUC score of 77.77 % with a GIN+virtual classifier, which makes it the most effective augmenter for this classifier on the leaderboard. This framework combines a GIN layer augmentation generator with a bias transformation and outperforms the same classifier augmented using the state-of-the-art FLAG augmentation.
Individually Fair Gradient Boosting
Vargo, Alexander, Zhang, Fan, Yurochkin, Mikhail, Sun, Yuekai
We consider the task of enforcing individual fairness in gradient boosting. Gradient boosting is a popular method for machine learning from tabular data, which arise often in applications where algorithmic fairness is a concern. At a high level, our approach is a functional gradient descent on a (distributionally) robust loss function that encodes our intuition of algorithmic fairness for the ML task at hand. Unlike prior approaches to individual fairness that only work with smooth ML models, our approach also works with non-smooth models such as decision trees. We show that our algorithm converges globally and generalizes. We also demonstrate the efficacy of our algorithm on three ML problems susceptible to algorithmic bias.
The Complexity of Nonconvex-Strongly-Concave Minimax Optimization
Zhang, Siqi, Yang, Junchi, Guzmán, Cristóbal, Kiyavash, Negar, He, Niao
This paper studies the complexity for finding approximate stationary points of nonconvex-strongly-concave (NC-SC) smooth minimax problems, in both general and averaged smooth finite-sum settings. We establish nontrivial lower complexity bounds of $\Omega(\sqrt{\kappa}\Delta L\epsilon^{-2})$ and $\Omega(n+\sqrt{n\kappa}\Delta L\epsilon^{-2})$ for the two settings, respectively, where $\kappa$ is the condition number, $L$ is the smoothness constant, and $\Delta$ is the initial gap. Our result reveals substantial gaps between these limits and best-known upper bounds in the literature. To close these gaps, we introduce a generic acceleration scheme that deploys existing gradient-based methods to solve a sequence of crafted strongly-convex-strongly-concave subproblems. In the general setting, the complexity of our proposed algorithm nearly matches the lower bound; in particular, it removes an additional poly-logarithmic dependence on accuracy present in previous works. In the averaged smooth finite-sum setting, our proposed algorithm improves over previous algorithms by providing a nearly-tight dependence on the condition number.
Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks
Nanni, Loris, Maguolo, Gianluca, Lumini, Alessandra
Stochastic gradient descent (SGD) is the main approach for training deep networks: it moves towards the optimum of the cost function by iteratively updating the parameters of a model in the direction of the gradient of the loss evaluated on a minibatch. Several variants of SGD have been proposed to make adaptive step sizes for each parameter (adaptive gradient) and take into account the previous updates (momentum). Among several alternative of SGD the most popular are AdaGrad, AdaDelta, RMSProp and Adam which scale coordinates of the gradient by square roots of some form of averaging of the squared coordinates in the past gradients and automatically adjust the learning rate on a parameter basis. In this work, we compare Adam based variants based on the difference between the present and the past gradients, the step size is adjusted for each parameter. We run several tests benchmarking proposed methods using medical image data. The experiments are performed using ResNet50 architecture neural network. Moreover, we have tested ensemble of networks and the fusion with ResNet50 trained with stochastic gradient descent. To combine the set of ResNet50 the simple sum rule has been applied. Proposed ensemble obtains very high performance, it obtains accuracy comparable or better than actual state of the art. To improve reproducibility and research efficiency the MATLAB source code used for this research is available at GitHub: https://github.com/LorisNanni.