AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

How Goodhart's Law Can Save Machine Learning Research

#artificialintelligenceNov-17-2020, 06:40:24 GMT

"When a measure becomes a target, it ceases to be a good measure." Stochastic Gradient Descent (SGD) has been responsible for many of the most outstanding achievements in machine learning. The objective of SGD is to optimise a target in the form of a loss function. But SGD fails in finding'standard' loss functions in a few settings as it converges to the'easy' solutions. As we see above, when classifying sheep, the network learns to use the green background to identify the sheep present.

algorithm, goodhart, save machine learning research, (12 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.59)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.36)

Add feedback

MG-GCN: Fast and Effective Learning with Mix-grained Aggregators for Training Large Graph Convolutional Networks

Huang, Tao, Zhang, Yihan, Wu, Jiajing, Fang, Junyuan, Zheng, Zibin

arXiv.org Artificial IntelligenceNov-17-2020

Graph convolutional networks (GCNs) have been employed as a kind of significant tool on many graph-based applications recently. Inspired by convolutional neural networks (CNNs), GCNs generate the embeddings of nodes by aggregating the information of their neighbors layer by layer. However, the high computational and memory cost of GCNs due to the recursive neighborhood expansion across GCN layers makes it infeasible for training on large graphs. To tackle this issue, several sampling methods during the process of information aggregation have been proposed to train GCNs in a mini-batch Stochastic Gradient Descent (SGD) manner. Nevertheless, these sampling strategies sometimes bring concerns about insufficient information collection, which may hinder the learning performance in terms of accuracy and convergence. To tackle the dilemma between accuracy and efficiency, we propose to use aggregators with different granularities to gather neighborhood information in different layers. Then, a degree-based sampling strategy, which avoids the exponential complexity, is constructed for sampling a fixed number of nodes. Combining the above two mechanisms, the proposed model, named Mix-grained GCN (MG-GCN) achieves state-of-the-art performance in terms of accuracy, training speed, convergence speed, and memory cost through a comprehensive set of experiments on four commonly used benchmark datasets and a new Ethereum dataset.

aggregator, neighbor, node, (15 more...)

arXiv.org Artificial Intelligence

2011.099

Country:

North America > United States > California > San Francisco County > San Francisco (0.28)
North America > United States > California > Los Angeles County > Long Beach (0.14)
Asia > China > Guangdong Province > Guangzhou (0.05)
(10 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.46)
Banking & Finance > Trading (0.37)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Contrastive Weight Regularization for Large Minibatch SGD

Yuan, Qiwei, Hua, Weizhe, Zhou, Yi, Yu, Cunxi

arXiv.org Machine LearningNov-17-2020

The minibatch stochastic gradient descent method (SGD) is widely applied in deep learning due to its efficiency and scalability that enable training deep networks with a large volume of data. Particularly in the distributed setting, SGD is usually applied with a large batch size. However, as opposed to small-batch SGD, neural network models trained with large-batch SGD can hardly generalize well, i.e., the validation accuracy is low. In this work, we introduce a novel regularization technique, namely distinctive regularization (DReg), which replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse. The DReg technique introduces very little computation overhead. Moreover, we empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and an improved generalization performance. We also demonstrate that DReg can boost the convergence of largebatch SGD with momentum. We believe that DReg can be used as a simple regularization trick to accelerate large-batch training in deep learning.

dreg, sgd, validation acc, (16 more...)

arXiv.org Machine Learning

2011.08968

Country:

North America > United States > Utah (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

ZORB: A Derivative-Free Backpropagation Algorithm for Neural Networks

Ranganathan, Varun, Lewandowski, Alex

arXiv.org Machine LearningNov-17-2020

Gradient descent and backpropagation have enabled neural networks to achieve remarkable results in many real-world applications. Despite ongoing success, training a neural network with gradient descent can be a slow and strenuous affair. We present a simple yet faster training algorithm called Zeroth-Order Relaxed Backpropagation (ZORB). Instead of calculating gradients, ZORB uses the pseudoinverse of targets to backpropagate information. ZORB is designed to reduce the time required to train deep neural networks without penalizing performance. To illustrate the speed up, we trained a feed-forward neural network with 11 layers on MNIST and observed that ZORB converged 300 times faster than Adam while achieving a comparable error rate, without any hyperparameter tuning. We also broaden the scope of ZORB to convolutional neural networks, and apply it to subsamples of the CIFAR-10 dataset. Experiments on standard classification and regression benchmarks demonstrate ZORB's advantage over traditional backpropagation with Gradient Descent.

algorithm, matrix, zorb, (15 more...)

arXiv.org Machine Learning

2011.08895

Country:

North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
North America > United States > California (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Backpropagation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)

Add feedback

Learning Regular Expressions for Interpretable Medical Text Classification Using a Pool-based Simulated Annealing and Word-vector Models

Tu, Chaofan, Bai, Ruibin, Lu, Zheng, Aickelin, Uwe, Ge, Peiming, Zhao, Jianshuang

arXiv.org Artificial IntelligenceNov-16-2020

In this paper, we propose a rule-based engine composed of high quality and interpretable regular expressions for medical text classification. The regular expressions are auto generated by a constructive heuristic method and optimized using a Pool-based Simulated Annealing (PSA) approach. Although existing Deep Neural Network (DNN) methods present high quality performance in most Natural Language Processing (NLP) applications, the solutions are regarded as uninterpretable black boxes to humans. Therefore, rule-based methods are often introduced when interpretable solutions are needed, especially in the medical field. However, the construction of regular expressions can be extremely labor-intensive for large data sets. This research aims to reduce the manual efforts while maintaining high-quality solutions

classifier, expression, regular expression, (16 more...)

arXiv.org Artificial Intelligence

2011.09351

Country:

Asia > China > Zhejiang Province > Ningbo (0.05)
Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(2 more...)

Genre:

Research Report (0.64)
Overview (0.48)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
(2 more...)

Add feedback

Regularized Mutual Information Neural Estimation

Choi, Kwanghee, Lee, Siyeong

arXiv.org Machine LearningNov-16-2020

With the variational lower bound of mutual information (MI), the estimation of MI can be understood as an optimization task via stochastic gradient descent. In this work, we start by showing how Mutual Information Neural Estimator (MINE) searches for the optimal function $T$ that maximizes the Donsker-Varadhan representation. With our synthetic dataset, we directly observe the neural network outputs during the optimization to investigate why MINE succeeds or fails: We discover the drifting phenomenon, where the constant term of $T$ is shifting through the optimization process, and analyze the instability caused by the interaction between the $logsumexp$ and the insufficient batch size. Next, through theoretical and experimental evidence, we propose a novel lower bound that effectively regularizes the neural network to alleviate the problems of MINE. We also introduce an averaging strategy that produces an unbiased estimate by utilizing multiple batches to mitigate the batch size limitation. Finally, we show that $L^2$ regularization achieves significant improvements in both discrete and continuous settings.

batch size, network output, representation, (13 more...)

arXiv.org Machine Learning

2011.07932

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

The Roadmap of Mathematics for Deep Learning

#artificialintelligenceNov-15-2020, 21:45:59 GMT

Knowing the mathematics behind machine learning algorithms is a superpower. If you have ever built a model for a real-life problem, you probably experienced that being familiar with the details can go a long way if you want to move beyond baseline performance. This is especially true when you want to push the boundaries of state of the art. However, most of this knowledge is hidden behind layers of advanced mathematics. Understanding methods like stochastic gradient descent might seem difficult since it is built on top of multivariable calculus and probability theory.

deep learning, mathematics, roadmap, (3 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.44)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

Placement in Integrated Circuits using Cyclic Reinforcement Learning and Simulated Annealing

Vashisht, Dhruv, Rampal, Harshit, Liao, Haiguang, Lu, Yang, Shanbhag, Devika, Fallon, Elias, Kara, Levent Burak

arXiv.org Artificial IntelligenceNov-15-2020

Physical design and production of Integrated Circuits (IC) is becoming increasingly more challenging as the sophistication in IC technology is steadily increasing. Placement has been one of the most critical steps in IC physical design. Through decades of research, partition-based, analytical-based and annealing-based placers have been enriching the placement solution toolbox. However, open challenges including long run time and lack of ability to generalize continue to restrict wider applications of existing placement tools. We devise a learning-based placement tool based on cyclic application of Reinforcement Learning (RL) and Simulated Annealing (SA) by leveraging the advancement of RL. Results show that the RL module is able to provide a better initialization for SA and thus leads to a better final placement design. Compared to other recent learning-based placers, our method is majorly different with its combination of RL and SA. It leverages the RL model's ability to quickly get a good rough solution after training and the heuristic's ability to realize greedy improvements in the solution.

algorithm, initialization, placement, (14 more...)

arXiv.org Artificial Intelligence

2011.07577

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.16)
North America > United States > North Carolina (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Semiconductors & Electronics (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback

Coresets for Robust Training of Neural Networks against Noisy Labels

Mirzasoleiman, Baharan, Cao, Kaidi, Leskovec, Jure

arXiv.org Machine LearningNov-14-2020

Modern neural networks have the capacity to overfit noisy labels frequently found in real-world datasets. Although great progress has been made, existing techniques are limited in providing theoretical guarantees for the performance of the neural networks trained with noisy labels. Here we propose a novel approach with strong theoretical guarantees for robust training of deep networks trained with noisy labels. The key idea behind our method is to select weighted subsets (coresets) of clean data points that provide an approximately low-rank Jacobian matrix. We then prove that gradient descent applied to the subsets do not overfit the noisy labels. Our extensive experiments corroborate our theory and demonstrate that deep networks trained on our subsets achieve a significantly superior performance compared to state-of-the art, e.g., 6% increase in accuracy on CIFAR-10 with 80% noisy labels, and 7% increase in accuracy on mini Webvision.

neural network, noisy label, subset, (14 more...)

arXiv.org Machine Learning

2011.07451

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Implicit bias of gradient-descent: fast convergence rate

Dohmatob, Elvis

arXiv.org Machine LearningNov-12-2020

We consider gradient-flow (GF) and gradient-descent (GD) on linear classification problems in possibly infinite-dimensional and non-hilbertian Banach spaces. For exponential-tailed loss functions, including the usual exponential and logistic loss functions, we establish $\mathcal O (\log (n)/ t)$ convergence rate for the bias in case of GF, and $\widetilde{\mathcal O}(\log (n)/\sqrt{t})$ in case of GD. This is a net improvement on best known rates, namely $\mathcal O(\log (n) / \log (t))$. See Ji and Telgarsky (2019), for example. Upto logarithmic factors, our GD rate matches the very recent parallel work from Ji and Telgarsky (2020) which uses an agressive stepsize schedule. Finally, using the aggressive stepsize schedule proposed py Ji and Telgarsky (2020), we are able to obtain a convergence rate of $\mathcal O(\log (n)/t)$ for the bias. Our methods of analysis are quite general and radically different from the usual techniques used in the literature: we use nonlinear error analysis for convex functions, in the spirit of Kurdyka-\L{}ojasiewicz theory. One major advantage of our method is that it allows us to convert any convergence rate for the margin, to a convergence rate on the bias, which is at least as good as the former. We believe our work will provide an alternative approach for analyzing the implicit bias of gradient-flow / gradient-descent in very general settings.

convergence rate, inequality, theorem 2, (12 more...)

arXiv.org Machine Learning

2011.0655

Country: Europe > Austria > Styria > Graz (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.84)

Add feedback