AITopics

1907.086

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)

Baldassi, Carlo, Malatesta, Enrico M., Zecchina, Riccardo

On the geometry of solutions and on the capacity of multi-layer neural networks with ReLU activations

Rectified Linear Units (ReLU) have become the main model for the neural units in current deep learning systems. This choice has been originally suggested as a way to compensate for the so called vanishing gradient problem which can undercut stochastic gradient descent (SGD) learning in networks composed of multiple layers. Here we provide analytical results on the effects of ReLUs on the capacity and on the geometrical landscape of the solution space in two-layer neural networks with either binary or real-valued weights. We study the problem of storing an extensive number of random patterns and find that, quite unexpectedly, the capacity of the network remains finite as the number of neurons in the hidden layer increases, at odds with the case of threshold units in which the capacity diverges. Possibly more important, a large deviation approach allows us to find that the geometrical landscape of the solution space has a peculiar structure: while the majority of solutions are close in distance but still isolated, there exist rare regions of solutions which are much more dense than the similar ones in the case of threshold units. These solutions are robust to perturbations of the weights and can tolerate large perturbations of the inputs. The analytical results are corroborated by numerical findings.

artificial intelligence, deep learning, machine learning, (19 more...)

1907.07578

Country:

North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Italy > Lombardy > Milan (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Meta-descent for Online, Continual Prediction

Jacobsen, Andrew, Schlegel, Matthew, Linke, Cameron, Degris, Thomas, White, Adam, White, Martha

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update---a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the step-size parameters to minimize prediction error. These meta-descent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental meta-descent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.

algorithm, artificial intelligence, machine learning, (18 more...)

1907.07751

Country:

North America > Canada > Alberta (0.46)
North America > United States > Massachusetts (0.28)

Genre: Research Report (1.00)

Industry: Education > Educational Setting (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Learning Privately over Distributed Features: An ADMM Sharing Approach

Hu, Yaochen, Liu, Peng, Kong, Linglong, Niu, Di

Distributed machine learning has been widely studied in order to handle exploding amount of data. In this paper, we study an important yet less visited distributed learning problem where features are inherently distributed or vertically partitioned among multiple parties, and sharing of raw data or model parameters among parties is prohibited due to privacy concerns. We propose an ADMM sharing framework to approach risk minimization over distributed features, where each party only needs to share a single value for each sample in the training process, thus minimizing the data leakage risk. We establish convergence and iteration complexity results for the proposed parallel ADMM algorithm under non-convex loss. We further introduce a novel differentially private ADMM sharing algorithm and bound the privacy guarantee with carefully designed noise perturbation. The experiments based on a prototype system shows that the proposed ADMM algorithms converge efficiently in a robust fashion, demonstrating advantage over gradient based methods especially for data set with high dimensional feature spaces.

algorithm, artificial intelligence, machine learning, (17 more...)

1907.07735

Country: North America > Canada > Alberta (0.14)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

$\texttt{DeepSqueeze}$: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

Tang, Hanlin, Lian, Xiangru, Qiu, Shuang, Yuan, Lei, Zhang, Ce, Zhang, Tong, Liu, Ji

Communication is a key bottleneck in distributed training. Recently, an \emph{error-compensated} compression technology was particularly designed for the \emph{centralized} learning and receives huge successes, by showing significant advantages over state-of-the-art compression based methods in saving the communication cost. Since the \emph{decentralized} training has been witnessed to be superior to the traditional \emph{centralized} training in the communication restricted scenario, therefore a natural question to ask is "how to apply the error-compensated technology to the decentralized learning to further reduce the communication cost." However, a trivial extension of compression based centralized training algorithms does not exist for the decentralized scenario. key difference between centralized and decentralized training makes this extension extremely non-trivial. In this paper, we propose an elegant algorithmic design to employ error-compensated stochastic gradient descent for the decentralized scenario, named $\texttt{DeepSqueeze}$. Both the theoretical analysis and the empirical study are provided to show the proposed $\texttt{DeepSqueeze}$ algorithm outperforms the existing compression based decentralized learning algorithms. To the best of our knowledge, this is the first time to apply the error-compensated compression to the decentralized learning.

algorithm, artificial intelligence, machine learning, (15 more...)

1907.07346

Country:

North America > United States (0.46)
Europe (0.28)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Liu, Suyun, Vicente, Luis Nunes

The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning

Optimization of conflicting functions is of paramount importance in decision making, and real world applications frequently involve data that is uncertain or unknown, resulting in multi-objective optimization (MOO) problems of stochastic type. We study the stochastic multi-gradient (SMG) method, seen as an extension of the classical stochastic gradient method for single-objective optimization. At each iteration of the SMG method, a stochastic multi-gradient direction is calculated by solving a quadratic subproblem, and it is shown that this direction is biased even when all individual gradient estimators are unbiased. We establish rates to compute a point in the Pareto front, of order similar to what is known for stochastic gradient in both convex and strongly convex cases. The analysis handles the bias in the multi-gradient and the unknown a priori weights of the limiting Pareto point. The SMG method is framed into a Pareto-front type algorithm for the computation of the entire Pareto front. The Pareto-front SMG algorithm is capable of robustly determining Pareto fronts for a number of synthetic test problems. One can apply it to any stochastic MOO problem arising from supervised machine learning, and we report results for logistic binary classification where multiple objectives correspond to distinct-sources data groups.

artificial intelligence, machine learning, pareto front, (16 more...)

1907.04472

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

arXiv.org Machine LearningJul-16-2019

SGD momentum optimizer with step estimation by online parabola model

Duda, Jarek

In stochastic gradient descent, especially for neural network training, there are currently dominating first order methods: not modeling local distance to minimum. This information required for optimal step size is provided by second order methods, however, they have many difficulties, starting with full Hessian having square of dimension number of coefficients. This article proposes a minimal step from successful first order momentum method toward second order: online parabola modelling in just a single direction: normalized $\hat{v}$ from momentum method. It is done by estimating linear trend of gradients $\vec{g}=\nabla F(\vec{\theta})$ in $\hat{v}$ direction: such that $g(\vec{\theta}_\bot+\theta\hat{v})\approx \lambda (\theta -p)$ for $\theta = \vec{\theta}\cdot \hat{v}$, $g= \vec{g}\cdot \hat{v}$, $\vec{\theta}_\bot=\vec{\theta}-\theta\hat{v}$. Using linear regression, $\lambda$, $p$ are MSE estimated by just updating four averages (of $g$, $\theta$, $g\theta$, $\theta^2$) in the considered direction. Exponential moving averages allow here for inexpensive online estimation, weakening contribution of the old gradients. Controlling sign of curvature $\lambda$, we can repel from saddles in contrast to attraction in standard Newton method. In the remaining directions: not considered in second order model, we can simultaneously perform e.g. gradient descent.

artificial intelligence, gradient, machine learning, (15 more...)

1907.07063

Country: Europe > Poland (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.78)

Xia, Shuhao, Shi, Yuanming

Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization

arXiv.org Machine LearningJul-15-2019

Although deep learning has shown its powerful performance in many applications, the mathematical principles behind neural networks are still mysterious. In this paper, we consider the problem of learning a one-hidden-layer neural network with quadratic activations. We focus on the under-parameterized regime where the number of hidden units is smaller than the dimension of the inputs. We shall propose to solve the problem via a provable gradient-based method with random initialization. For the non-convex neural networks training problem we reveal that the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations, and then provably converges to a globally optimal model at a linear rate with near-optimal sample complexity. We further corroborate our theoretical findings via various experiments.

artificial intelligence, learning one-hidden-layer neural network, machine learning, (2 more...)

1907.06594

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Saito, Shota, Shirakawa, Shinichi

Controlling Model Complexity in Probabilistic Model-Based Dynamic Optimization of Neural Network Structures

arXiv.org Machine LearningJul-15-2019

A method of simultaneously optimizing both the structure of neural networks and the connection weights in a single training loop can reduce the enormous computational cost of neural architecture search. We focus on the probabilistic model-based dynamic neural network structure optimization that considers the probability distribution of structure parameters and simultaneously optimizes both the distribution parameters and connection weights based on gradient methods. Since the existing algorithm searches for the structures that only minimize the training loss, this method might find overly complicated structures. In this paper, we propose the introduction of a penalty term to control the model complexity of obtained structures. We formulate a penalty term using the number of weights or units and derive its analytical natural gradient. The proposed method minimizes the objective function injected the penalty term based on the stochastic gradient descent. We apply the proposed method in the unit selection of a fully-connected neural network and the connection selection of a convolutional neural network. The experimental results show that the proposed method can control model complexity while maintaining performance.

artificial intelligence, machine learning, selection, (15 more...)

1907.06341

Country: Asia > Japan > Honshū > Kantō (0.46)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Cheng, Xiang, Yin, Dong, Bartlett, Peter L., Jordan, Michael I.

Quantitative $W_1$ Convergence of Langevin-Like Stochastic Processes with Non-Convex Potential State-Dependent Noise

arXiv.org Machine LearningJul-13-2019

Stochastic Gradient Descent (SGD) is one of the workhorses of modern day machine learning. In many nonconvex optimization problems, such as training deep neural networks, SGD is able to produce solutions with good generalization error. Further, there is evidence that the generalization error of an SGD solution can be significantly better than Gradient Descent (GD) [12]. This suggests that, to understand the behavior of SGD, it is not enough to consider the limiting cases (such as small step-size or large batch-size), when it degenerates to GD. We take an alternate view of SGD as a sampling algorithm, and aim to understand its convergence to an appropriate stationary distribution.

artificial intelligence, inequality, machine learning, (16 more...)

1907.03215

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)