Goto

Collaborating Authors

 Gradient Descent


How Goodhart's Law Can Save Machine Learning Research

#artificialintelligence

"When a measure becomes a target, it ceases to be a good measure." Stochastic Gradient Descent (SGD) has been responsible for many of the most outstanding achievements in machine learning. The objective of SGD is to optimise a target in the form of a loss function. But SGD fails in finding'standard' loss functions in a few settings as it converges to the'easy' solutions. As we see above, when classifying sheep, the network learns to use the green background to identify the sheep present.


MG-GCN: Fast and Effective Learning with Mix-grained Aggregators for Training Large Graph Convolutional Networks

arXiv.org Artificial Intelligence

Graph convolutional networks (GCNs) have been employed as a kind of significant tool on many graph-based applications recently. Inspired by convolutional neural networks (CNNs), GCNs generate the embeddings of nodes by aggregating the information of their neighbors layer by layer. However, the high computational and memory cost of GCNs due to the recursive neighborhood expansion across GCN layers makes it infeasible for training on large graphs. To tackle this issue, several sampling methods during the process of information aggregation have been proposed to train GCNs in a mini-batch Stochastic Gradient Descent (SGD) manner. Nevertheless, these sampling strategies sometimes bring concerns about insufficient information collection, which may hinder the learning performance in terms of accuracy and convergence. To tackle the dilemma between accuracy and efficiency, we propose to use aggregators with different granularities to gather neighborhood information in different layers. Then, a degree-based sampling strategy, which avoids the exponential complexity, is constructed for sampling a fixed number of nodes. Combining the above two mechanisms, the proposed model, named Mix-grained GCN (MG-GCN) achieves state-of-the-art performance in terms of accuracy, training speed, convergence speed, and memory cost through a comprehensive set of experiments on four commonly used benchmark datasets and a new Ethereum dataset.


Contrastive Weight Regularization for Large Minibatch SGD

arXiv.org Machine Learning

The minibatch stochastic gradient descent method (SGD) is widely applied in deep learning due to its efficiency and scalability that enable training deep networks with a large volume of data. Particularly in the distributed setting, SGD is usually applied with a large batch size. However, as opposed to small-batch SGD, neural network models trained with large-batch SGD can hardly generalize well, i.e., the validation accuracy is low. In this work, we introduce a novel regularization technique, namely distinctive regularization (DReg), which replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse. The DReg technique introduces very little computation overhead. Moreover, we empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and an improved generalization performance. We also demonstrate that DReg can boost the convergence of largebatch SGD with momentum. We believe that DReg can be used as a simple regularization trick to accelerate large-batch training in deep learning.


ZORB: A Derivative-Free Backpropagation Algorithm for Neural Networks

arXiv.org Machine Learning

Gradient descent and backpropagation have enabled neural networks to achieve remarkable results in many real-world applications. Despite ongoing success, training a neural network with gradient descent can be a slow and strenuous affair. We present a simple yet faster training algorithm called Zeroth-Order Relaxed Backpropagation (ZORB). Instead of calculating gradients, ZORB uses the pseudoinverse of targets to backpropagate information. ZORB is designed to reduce the time required to train deep neural networks without penalizing performance. To illustrate the speed up, we trained a feed-forward neural network with 11 layers on MNIST and observed that ZORB converged 300 times faster than Adam while achieving a comparable error rate, without any hyperparameter tuning. We also broaden the scope of ZORB to convolutional neural networks, and apply it to subsamples of the CIFAR-10 dataset. Experiments on standard classification and regression benchmarks demonstrate ZORB's advantage over traditional backpropagation with Gradient Descent.


Learning Regular Expressions for Interpretable Medical Text Classification Using a Pool-based Simulated Annealing and Word-vector Models

arXiv.org Artificial Intelligence

In this paper, we propose a rule-based engine composed of high quality and interpretable regular expressions for medical text classification. The regular expressions are auto generated by a constructive heuristic method and optimized using a Pool-based Simulated Annealing (PSA) approach. Although existing Deep Neural Network (DNN) methods present high quality performance in most Natural Language Processing (NLP) applications, the solutions are regarded as uninterpretable black boxes to humans. Therefore, rule-based methods are often introduced when interpretable solutions are needed, especially in the medical field. However, the construction of regular expressions can be extremely labor-intensive for large data sets. This research aims to reduce the manual efforts while maintaining high-quality solutions


Regularized Mutual Information Neural Estimation

arXiv.org Machine Learning

With the variational lower bound of mutual information (MI), the estimation of MI can be understood as an optimization task via stochastic gradient descent. In this work, we start by showing how Mutual Information Neural Estimator (MINE) searches for the optimal function $T$ that maximizes the Donsker-Varadhan representation. With our synthetic dataset, we directly observe the neural network outputs during the optimization to investigate why MINE succeeds or fails: We discover the drifting phenomenon, where the constant term of $T$ is shifting through the optimization process, and analyze the instability caused by the interaction between the $logsumexp$ and the insufficient batch size. Next, through theoretical and experimental evidence, we propose a novel lower bound that effectively regularizes the neural network to alleviate the problems of MINE. We also introduce an averaging strategy that produces an unbiased estimate by utilizing multiple batches to mitigate the batch size limitation. Finally, we show that $L^2$ regularization achieves significant improvements in both discrete and continuous settings.


The Roadmap of Mathematics for Deep Learning

#artificialintelligence

Knowing the mathematics behind machine learning algorithms is a superpower. If you have ever built a model for a real-life problem, you probably experienced that being familiar with the details can go a long way if you want to move beyond baseline performance. This is especially true when you want to push the boundaries of state of the art. However, most of this knowledge is hidden behind layers of advanced mathematics. Understanding methods like stochastic gradient descent might seem difficult since it is built on top of multivariable calculus and probability theory.


Placement in Integrated Circuits using Cyclic Reinforcement Learning and Simulated Annealing

arXiv.org Artificial Intelligence

Physical design and production of Integrated Circuits (IC) is becoming increasingly more challenging as the sophistication in IC technology is steadily increasing. Placement has been one of the most critical steps in IC physical design. Through decades of research, partition-based, analytical-based and annealing-based placers have been enriching the placement solution toolbox. However, open challenges including long run time and lack of ability to generalize continue to restrict wider applications of existing placement tools. We devise a learning-based placement tool based on cyclic application of Reinforcement Learning (RL) and Simulated Annealing (SA) by leveraging the advancement of RL. Results show that the RL module is able to provide a better initialization for SA and thus leads to a better final placement design. Compared to other recent learning-based placers, our method is majorly different with its combination of RL and SA. It leverages the RL model's ability to quickly get a good rough solution after training and the heuristic's ability to realize greedy improvements in the solution.


Coresets for Robust Training of Neural Networks against Noisy Labels

arXiv.org Machine Learning

Modern neural networks have the capacity to overfit noisy labels frequently found in real-world datasets. Although great progress has been made, existing techniques are limited in providing theoretical guarantees for the performance of the neural networks trained with noisy labels. Here we propose a novel approach with strong theoretical guarantees for robust training of deep networks trained with noisy labels. The key idea behind our method is to select weighted subsets (coresets) of clean data points that provide an approximately low-rank Jacobian matrix. We then prove that gradient descent applied to the subsets do not overfit the noisy labels. Our extensive experiments corroborate our theory and demonstrate that deep networks trained on our subsets achieve a significantly superior performance compared to state-of-the art, e.g., 6% increase in accuracy on CIFAR-10 with 80% noisy labels, and 7% increase in accuracy on mini Webvision.


Implicit bias of gradient-descent: fast convergence rate

arXiv.org Machine Learning

We consider gradient-flow (GF) and gradient-descent (GD) on linear classification problems in possibly infinite-dimensional and non-hilbertian Banach spaces. For exponential-tailed loss functions, including the usual exponential and logistic loss functions, we establish $\mathcal O (\log (n)/ t)$ convergence rate for the bias in case of GF, and $\widetilde{\mathcal O}(\log (n)/\sqrt{t})$ in case of GD. This is a net improvement on best known rates, namely $\mathcal O(\log (n) / \log (t))$. See Ji and Telgarsky (2019), for example. Upto logarithmic factors, our GD rate matches the very recent parallel work from Ji and Telgarsky (2020) which uses an agressive stepsize schedule. Finally, using the aggressive stepsize schedule proposed py Ji and Telgarsky (2020), we are able to obtain a convergence rate of $\mathcal O(\log (n)/t)$ for the bias. Our methods of analysis are quite general and radically different from the usual techniques used in the literature: we use nonlinear error analysis for convex functions, in the spirit of Kurdyka-\L{}ojasiewicz theory. One major advantage of our method is that it allows us to convert any convergence rate for the margin, to a convergence rate on the bias, which is at least as good as the former. We believe our work will provide an alternative approach for analyzing the implicit bias of gradient-flow / gradient-descent in very general settings.