AITopics

1910.0872

Country: Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Malach, Eran, Shalev-Shwartz, Shai

Learning Boolean Circuits with Neural Networks

arXiv.org Machine LearningOct-25-2019

Training neural-networks is computationally hard. However, in practice they are trained efficiently using gradient-based algorithms, achieving remarkable performance on natural data. To bridge this gap, we observe the property of local correlation: correlation between small patterns of the input and the target label. We focus on learning deep neural-networks with a variant of gradient-descent, when the target function is a tree-structured Boolean circuit. We show that in this case, the existence of correlation between the gates of the circuit and the target label determines whether the optimization succeeds or fails. Using this result, we show that neural-networks can learn the (log n)-parity problem for most product distributions. These results hint that local correlation may play an important role in differentiating between distributions that are hard or easy to learn.

boolean circuit, conference paper, nullw, (15 more...)

1910.11923

Country: Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Machine LearningOct-25-2019

Bias-Variance Tradeoff in a Sliding Window Implementation of the Stochastic Gradient Algorithm

Papo, Yakup Ceki

This paper provides a framework to analyze stochastic gradient algorithms in a mean squared error (MSE) sense using the asymptotic normality result of the stochastic gradient descent (SGD) iterates. We perform this analysis by taking the asymptotic normality result and applying it to the finite iteration case. Specifically, we look at problems where the gradient estimators are biased and have reduced variance and compare the iterates generated by these gradient estimators to the iterates generated by the SGD algorithm. We use the work of Fabian to characterize the mean and the variance of the distribution of the iterates in terms of the bias and the covariance matrix of the gradient estimators. We introduce the sliding window SGD (SW-SGD) algorithm, with its proof of convergence, which incurs a lower MSE than the SGD algorithm on quadratic and convex problems. Lastly, we present some numerical results to show the effectiveness of this framework and the superiority of SW-SGD algorithm over the SGD algorithm.

algorithm, estimator, gradient estimator, (14 more...)

1910.11868

Country:

North America > United States > Maryland > Baltimore (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Panigrahi, Abhishek, Somani, Raghav, Goyal, Navin, Netrapalli, Praneeth

Non-Gaussianity of Stochastic Gradient Noise

arXiv.org Machine LearningOct-25-2019

What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.

cifar10, gaussianity test experiment, mini batch-size, (10 more...)

1910.09626

Country:

Asia > India > Karnataka > Bengaluru (0.05)
Oceania > Australia > Western Australia > Perth (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.85)

Preen, Richard J., Wilson, Stewart W., Bull, Larry

Autoencoding with XCSF

arXiv.org Artificial IntelligenceOct-23-2019

Autoencoders enable data dimensionality reduction and are a key component of many (deep) learning systems. This article explores the use of the XCSF online evolutionary reinforcement learning system to perform autoencoding. Initial results using a neural network representation and combining artificial evolution with stochastic gradient descent, suggest it is an effective approach to data reduction. The approach adaptively subdivides the input domain into local approximations that are simpler than a global neural network solution. By allowing the number of neurons in the autoencoders to evolve, this further enables the emergence of an ensemble of structurally heterogeneous solutions to cover the problem space. In this case, networks of differing complexity are typically seen to cover different areas of the problem space. Furthermore, the rate of gradient descent applied to each layer is tuned via self-adaptive mutation, thereby reducing the parameter optimisation task.

evolutionary computation, neural network, neuron, (14 more...)

arXiv.org Artificial Intelligence

1910.10579

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New Jersey > Middlesex County > Piscataway (0.05)
Europe > Germany > Berlin (0.04)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.79)

#artificialintelligenceOct-22-2019, 08:10:37 GMT

r/MachineLearning - [D] Machine Learning - WAYR (What Are You Reading) - Week 72

I've been idly wondering lately about the problem of identifying high value samples to obtain for improving models, which seems to get at something similar under uncertainty. It isn't necessarily going to be economical to do an exhaustive sampling of whatever you're interested in, but collecting a few strategic datapoints could be relatively affordable and help a lot with inference. I also was wondering if some kind of hypothesis falsification module could be stapled onto gradient descent algorithms somehow. In terms of simulated annealing, because that's mentally easier for me, the idea would be that we want the temperature of nonlocal jumps to be hotter when the machine is making failed guesses, and we want it to be cooler when the gradient is behaving like the falsification module expects. The motivation for this is just that for inference, a lot of the time it is easier to learn things if you go out of your way to test your assumptions. Just having those assumptions be consistent with your observations is only a weak test of their value.

falsification module, machine learning, machinelearning, (3 more...)

#artificialintelligence

Industry: Media > News (0.40)

Technology:

Information Technology > Communications > Social Media (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Faster Stochastic Algorithms via History-Gradient Aided Batch Size Adaptation

Ji, Kaiyi, Wang, Zhe, Zhou, Yi, Liang, Yingbin

Various schemes for adapting batch size have been recently proposed to accelerate stochastic algorithms. However, existing schemes either apply prescribed batch size adaption or require additional backtracking and condition verification steps to exploit the information along optimization path. In this paper, we propose an easy-to-implement scheme for adapting batch size by exploiting history stochastic gradients, based on which we propose the Adaptive batch size SGD (AbaSGD), AbaSVRG, and AbaSPIDER algorithms. To handle the dependence of the batch size on history stochastic gradients, we develop a new convergence analysis technique, and show that these algorithms achieve improved overall complexity over their vanilla counterparts. Moreover, their convergence rates are adaptive to the optimization landscape that the iterate experiences. Extensive experiments demonstrate that our algorithms substantially outperform existing competitive algorithms.

algorithm, batch size, complexity, (14 more...)

1910.0967

Country:

North America > United States > Ohio (0.04)
North America > United States > Utah (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Dong, Zhe, Seybold, Bryan A., Murphy, Kevin P., Bui, Hung H.

Collapsed Amortized Variational Inference for Switching Nonlinear Dynamical Systems

We propose an efficient inference method for switching nonlinear dynamical systems. The key idea is to learn an inference network which can be used as a proposal distribution for the continuous latent variables, while performing exact marginalization of the discrete latent variables. This allows us to use the reparameterization trick, and apply end-to-end training with stochastic gradient descent. We show that the proposed method can successfully segment time series data (including videos) into meaningful "regimes", by using the piece-wise nonlinear dynamics.

discrete state, inference network, preprint, (16 more...)

1910.09588

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
(2 more...)

Malitsky, Yura, Mishchenko, Konstantin

Adaptive gradient descent without descent

Yura Malitsky Konstantin Mishchenko † Abstract We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don't increase the stepsize too fast and 2) don't overstep the local curvature. By following these rules, you get a method adaptive to the local geometry, with convergence guarantees depending only on smoothness in a neighborhood of a solution. Given that the problem is convex, our method will converge even if the global smoothness constant is infinity. As an illustration, it can minimize arbitrary continuously twice-differentiable convex function. We examine its performance on a range of convex and nonconvex problems, including matrix factorization and training of ResNet-18. 1 Introduction Since the early days of optimization it was evident that there is a need for algorithms that are as independent from the user as possible. First-order methods have proven to be versatile and efficient in a wide range of applications, but one drawback has been present all that time: the stepsize. Despite some certain success stories, line search procedures and adaptive online methods have not removed the need to manually tune the optimization parameters. Even in smooth convex optimization, which is often believed to be much simpler than the nonconvex counterpart, robust rules for stepsize selection have been elusive. The purpose of this work is to remedy this deficiency. The problem formulation that we consider is the basic unconstrained optimization problem min x R d f (x), (1) where f: R d R is a differentiable function. Throughout the paper we assume that (1) has a solution and we denote its optimal value by f . The simplest and most known approach to this problem is the gradient descent method (GD), whose origin can be traced back to Cauchy [7,20]. Although it is probably the oldest optimization method, it continues to play a central role in modern algorithmic theory and applications. Its definition can be written in a mere one line, x k 1 x k λ f (x k), k 0, (2) where x 0 R d is arbitrary and λ 0 .

descent, inequality, stepsize, (15 more...)

1910.09529

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
Asia > Middle East > Saudi Arabia > Mecca Province > Thuwal (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)

Candela, Rosa, Franzese, Giulio, Filippone, Maurizio, Michiardi, Pietro

Sparsification as a Remedy for Staleness in Distributed Asynchronous SGD

Large scale machine learning is increasingly relying on distributed optimization, whereby several machines contribute to the training process of a statistical model. While there exist a large literature on stochastic gradient descent (SGD) and variants, the study of countermeasures to mitigate problems arising in asynchronous distributed settings are still in their infancy. The key question of this work is whether sparsification, a technique predominantly used to reduce communication overheads, can also mitigate the staleness problem that affects asynchronous SGD. We study the role of sparsification both theoretically and empirically. Our theory indicates that, in an asynchronous, non-convex setting, the ergodic convergence rate of sparsified SGD matches the known result $\mathcal{O} \left( 1/\sqrt{T} \right)$ of non-convex SGD. We then carry out an empirical study to complement our theory and show that, in practice, sparsification consistently improves over vanilla SGD and current alternatives to mitigate the effects of staleness.

gradient, sparsification, staleness, (17 more...)

1910.09466

Country:

North America > United States > Colorado > Broomfield County > Broomfield (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)