AITopics

2002.0484

Country:

North America > United States > Arizona (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report (0.82)

Industry: Education > Educational Setting > Online (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

Collins, Liam, Mokhtari, Aryan, Shakkottai, Sanjay

Distribution-Agnostic Model-Agnostic Meta-Learning

arXiv.org Machine LearningFeb-11-2020

The Model-Agnostic Meta-Learning (MAML) algorithm \citep{finn2017model} has been celebrated for its efficiency and generality, as it has demonstrated success in quickly learning the parameters of an arbitrary learning model. However, MAML implicitly assumes that the tasks come from a particular distribution, and optimizes the expected (or sample average) loss over tasks drawn from this distribution. Here, we amend this limitation of MAML by reformulating the objective function as a min-max problem, where the maximization is over the set of possible distributions over tasks. Our proposed algorithm is the first distribution-agnostic and model-agnostic meta-learning method, and we show that it converges to an $\epsilon$-accurate point at the rate of $\mathcal{O}(1/\epsilon^2)$ in the convex setting and to an $(\epsilon, \delta)$-stationary point at the rate of $\mathcal{O}(\max\{1/\epsilon^5, 1/\delta^5\})$ in nonconvex settings. We also provide numerical experiments that demonstrate the worst-case superiority of our algorithm in comparison to MAML.

algorithm, gradient, stochastic gradient, (14 more...)

2002.04766

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.31)

Mulayoff, Rotem, Michaeli, Tomer

Unique Properties of Wide Minima in Deep Networks

arXiv.org Machine LearningFeb-11-2020

It is well known that (stochastic) gradient descent has an implicit bias towards wide minima. In deep neural network training, this mechanism serves to screen out minima. However, the precise effect that this has on the trained network is not yet fully understood. In this paper, we characterize the wide minima in linear neural networks trained with a quadratic loss. First, we show that linear ResNets with zero initialization necessarily converge to the widest of all minima. We then prove that these minima correspond to nearly balanced networks whereby the gain from the input to any intermediate representation does not change drastically from one layer to the next. Finally, we show that consecutive layers in wide minima solutions are coupled. That is, one of the left singular vectors of each weight matrix, equals one of the right singular vectors of the next matrix. This forms a distinct path from input to output, that, as we show, is dedicated to the signal that experiences the largest gain end-to-end. Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.

matrix, minima, unique property, (14 more...)

2002.0471

Country: Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Gupta, Sidharth, Dube, Parijat, Verma, Ashish

Improving the affordability of robustness training for DNNs

arXiv.org Machine LearningFeb-11-2020

Projected Gradient Descent (PGD) based adversarial training has become one of the most prominent methods for building robust deep neural network models. However, the computational complexity associated with this approach, due to the maximization of the loss function when finding adversaries, is a longstanding problem and may be prohibitive when using larger and more complex models. In this paper, we propose a modification of the PGD method for adversarial training and demonstrate that models can be trained much more efficiently without any loss in accuracy on natural and adversarial samples. We argue that the initial phase of adversarial training is redundant and can be replaced with natural training thereby increasing the computational efficiency significantly. We support our argument with insights on the nature of the adversaries and their relative strength during the training process. We show that our proposed method can reduce the training time to up to 38\% of the original training time with comparable model accuracy and generalization on various strengths of adversarial attacks.

accuracy, adversarial sample, adversarial training, (15 more...)

2002.04237

Country: North America > United States > Illinois (0.04)

Genre: Research Report (0.82)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Infinity Learning: Learning Markov Chains from Aggregate Steady-State Observations

Gao, Jianfei, Zahran, Mohamed A., Sheoran, Amit, Fahmy, Sonia, Ribeiro, Bruno

We consider the task of learning a parametric Continuous Time Markov Chain (CTMC) sequence model without examples of sequences, where the training data consists entirely of aggregate steady-state statistics. Making the problem harder, we assume that the states we wish to predict are unobserved in the training data. Specifically, given a parametric model over the transition rates of a CTMC and some known transition rates, we wish to extrapolate its steady state distribution to states that are unobserved. A technical roadblock to learn a CTMC from its steady state has been that the chain rule to compute gradients will not work over the arbitrarily long sequences necessary to reach steady state ---from where the aggregate statistics are sampled. To overcome this optimization challenge, we propose $\infty$-SGD, a principled stochastic gradient descent method that uses randomly-stopped estimators to avoid infinite sums required by the steady state computation, while learning even when only a subset of the CTMC states can be observed. We apply $\infty$-SGD to a real-world testbed and synthetic experiments showcasing its accuracy, ability to extrapolate the steady state distribution to unobserved states under unobserved conditions (heavy loads, when training under light loads), and succeeding in difficult scenarios where even a tailor-made extension of existing methods fails.

equation, parametric model, training data, (14 more...)

2002.04186

Country:

South America > Brazil (0.04)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
North America > United States > District of Columbia > Washington (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Chu, Casey, Minami, Kentaro, Fukumizu, Kenji

Smoothness and Stability in GANs

In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence that is minimized by the GAN and the generator's architecture. We find that existing GAN variants satisfy some, but not all, of these conditions. Using tools from convex analysis, optimal transport, and reproducing kernels, we construct a GAN that fulfills these conditions simultaneously. In the process, we explain and clarify the need for various existing GAN stabilization techniques, including Lipschitz constraints, gradient penalties, and smooth activation functions.

conference paper, discriminator, optimal discriminator, (14 more...)

2002.04185

Country:

North America > United States > New York (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

A Fully Online Approach for Covariance Matrices Estimation of Stochastic Gradient Descent Solutions

Zhu, Wanrong, Chen, Xi, Wu, Wei Biao

Stochastic gradient descent (SGD) algorithm is widely used for parameter estimation especially in online setting. While this recursive algorithm is popular for computation and memory efficiency, the problem of quantifying variability and randomness of the solutions has been rarely studied. This paper aims at conducting statistical inference of SGD-based estimates in online setting. In particular, we propose a fully online estimator for the covariance matrix of averaged SGD iterates (ASGD). Based on the classic asymptotic normality results of ASGD, we construct asymptotically valid confidence intervals for model parameters. Upon receiving new observations, we can quickly update the covariance estimator and confidence intervals. This approach fits in online setting even if the total number of data is unknown and takes the full advantage of SGD: efficiency in both computation and memory.

confidence interval, estimator, lemma, (12 more...)

2002.03979

Country:

North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Ablin, Pierre, Peyré, Gabriel, Moreau, Thomas

Super-efficiency of automatic differentiation for functions defined as a minimum

In min-min optimization or max-min optimization, one has to compute the gradient of a function defined as a minimum. In most cases, the minimum has no closed-form, and an approximation is obtained via an iterative algorithm. There are two usual ways of estimating the gradient of the function: using either an analytic formula obtained by assuming exactness of the approximation, or automatic differentiation through the algorithm. In this paper, we study the asymptotic error made by these estimators as a function of the optimization error. We find that the error of the automatic estimator is close to the square of the error of the analytic estimator, reflecting a super-efficiency phenomenon. The convergence of the automatic estimator greatly depends on the convergence of the Jacobian of the algorithm. We analyze it for gradient descent and stochastic gradient descent and derive convergence rates for the estimators in these cases. Our analysis is backed by numerical experiments on toy problems and on Wasserstein barycenter computation. Finally, we discuss the computational complexity of these estimators and give practical guidelines to chose between them.

convergence, estimator, gradient descent, (13 more...)

2002.03722

Country:

Europe > France (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Spain (0.04)
Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.77)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Tzen, Belinda, Raginsky, Maxim

A mean-field theory of lazy training in two-layer neural nets: entropic regularization and controlled McKean-Vlasov dynamics

We consider the problem of universal approximation of functions by two-layer neural nets with random weights that are "nearly Gaussian" in the sense of Kullback-Leibler divergence. This problem is motivated by recent works on lazy training, where the weight updates generated by stochastic gradient descent do not move appreciably from the i.i.d. Gaussian initialization. We first consider the mean-field limit, where the finite population of neurons in the hidden layer is replaced by a continual ensemble, and show that our problem can be phrased as global minimization of a free-energy functional on the space of probability measures over the weights. This functional trades off the $L^2$ approximation risk against the KL divergence with respect to a centered Gaussian prior. We characterize the unique global minimizer and then construct a controlled nonlinear dynamics in the space of probability measures over weights that solves a McKean--Vlasov optimal control problem. This control problem is closely related to the Schr\"odinger bridge (or entropic optimal transport) problem, and its value is proportional to the minimum of the free energy. Finally, we show that SGD in the lazy training regime (which can be ensured by jointly tuning the variance of the Gaussian prior and the entropic regularization parameter) serves as a greedy approximation to the optimal McKean--Vlasov distributional dynamics and provide quantitative guarantees on the $L^2$ approximation error.

approximation, mckean-vlasov dynamic, probability, (16 more...)

2002.01987

Country:

North America > United States > Illinois (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

arXiv.org Machine LearningFeb-9-2020

Semi-Implicit Back Propagation

Liu, Ren, Zhang, Xiaoqun

Neural network has attracted great attention for a long time and many researchers are devoted to improve the effectiveness of neural network training algorithms. Though stochastic gradient descent (SGD) and other explicit gradient-based methods are widely adopted, there are still many challenges such as gradient vanishing and small step sizes, which leads to slow convergence and instability of SGD algorithms. Motivated by error back propagation (BP) and proximal methods, we propose a semi-implicit back propagation method for neural network training. Similar to BP, the difference on the neurons are propagated in a backward fashion and the parameters are updated with proximal mapping. The implicit update for both hidden neurons and parameters allows to choose large step size in the training algorithm. Finally, we also show that any fixed point of convergent sequences produced by this algorithm is a stationary point of the objective loss function. The experiments on both MNIST and CIFAR-10 demonstrate that the proposed semi-implicit BP algorithm leads to better performance in terms of both loss decreasing and training/validation accuracy, compared to SGD and a similar algorithm ProxBP.

gradient, neural network, proxbp, (16 more...)

2002.03516

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)