AITopics

1806.11382

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.84)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

arXiv.org Machine LearningJun-28-2018

Direct Acceleration of SAGA using Sampled Negative Momentum

Zhou, Kaiwen

We focus on achieving very high accuracy for Problem (1), although for practical optimization tasks, such as supervised learning, low empirical risk may result in high generalization error. In this paper, we treat Problem (1) as a pure optimization problem. When F(·) in Problem (1) is strongly convex, traditional analysis shows that gradient descent (GD) yields a fast linear convergence rate but with a high per-iteration cost, and thus may not be suitable for problems with a very large n. As an alternative for large problems, SGD [Robbins and Monro, 1951] uses only one or a mini-batch of gradients in each iteration, and thus enjoys significantly lower per-iteration complexity than GD. However, due to the variance of gradient estimator, vanilla SGD is shown to yield only a sub-linear convergence rate. Recently, stochastic variance reduced methods (e.g., SAG [Roux et al., 2012], SVRG [Johnson and Zhang, 2013], SAGA [Defazio et al., 2014], and their proximal variants, such as [Schmidt et al., 2017], [Xiao and Zhang, 2014] and [Konečný et al., 2016]) were proposed to solve Problem (1). All these methods are equipped with various variance reduction techniques, which help them achieve low per-iteration complexities comparable with SGD and at the same time maintain the fast linear convergence rate of GD.

artificial intelligence, machine learning, saga, (16 more...)

1806.11048

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

arXiv.org Machine LearningJun-27-2018

Empirical Risk Minimization and Stochastic Gradient Descent for Relational Data

Veitch, Victor, Austern, Morgane, Zhou, Wenda, Blei, David M., Orbanz, Peter

Empirical risk minimization is the principal tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent advances in graph sampling theory. We (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this risk that are automatically unbiased. The key ingredient is to consider the method by which data is sampled from a graph as an explicit component of model design. Theoretical results establish that the choice of sampling scheme is critical. By integrating fast implementations of graph sampling schemes with standard automatic differentiation tools, we are able to solve the risk minimization in a plug-and-play fashion even on large datasets. We demonstrate empirically that relational ERM models achieve state-of-the-art results on semi-supervised node classification tasks. The experiments also confirm the importance of the choice of sampling scheme.

artificial intelligence, machine learning, vertex, (19 more...)

1806.10701

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Alet, Ferran, Lozano-Pérez, Tomás, Kaelbling, Leslie P.

Modular meta-learning

In many situations, such as robot-learning, training experience is very expensive. One strategy for reducing the amount of training data needed for a new task is to learn some form of prior or bias using data from several related tasks. The objective of this process is to extract information that will substantially reduce the training-data requirements for a new task. This problem is a form of transfer learning, sometimes also called meta-learning or "learning to learn" [1, 2]. Previous approaches to meta-learning for robotics have focused on finding distributions over [3] or initial values of [4, 5] parameters, based on a set of "training tasks," that will enable a new "test task" to be learned with many fewer training examples. Our objective is similar, but rather than focusing on transferring information about parameter values, we focus on finding a reusable set of modules that can form components of a solution to a new task, possibly with a small amount of tuning. Modular approaches to learning have been very successful in structured tasks such as naturallanguage sentence interpretation [6], in which the input signal gives relatively direct information about a good structural decomposition of the problem. We wish to address problems that may benefit from a modular decomposition but do not provide any task-level input from which the structure of a solution may be derived. Nonetheless, we adopt a similar modular structure and parameteradaptation method for learning our reusable modules, but use a general-purpose simulated-annealing search strategy to find an appropriate structural decomposition for each new task.

artificial intelligence, machine learning, module, (17 more...)

1806.10166

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

HaoChen, Jeffery Z., Sra, Suvrit

Random Shuffling Beats SGD after Finite Epochs

A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we prove that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/T^3), where n is the number of components in the objective, and T is the total number of iterations. This result shows that after a reasonable number of epochs RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis will show, in general a dependence on n is unavoidable without further changes to the algorithm. We show that for sparse data RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Furthermore, we discuss extensions to nonconvex gradient dominated functions, as well as non-strongly convex settings.

andomshuffle, artificial intelligence, machine learning, (14 more...)

1806.10077

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Arjevani, Yossi, Shamir, Ohad, Srebro, Nathan

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $\tau$ rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/\tau$ of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions.

artificial intelligence, gradient descent, machine learning, (15 more...)

1806.10188

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Qaadan, Sahar, Glasmachers, Tobias

Multi-Merge Budget Maintenance for Stochastic Gradient Descent SVM Training

Budgeted Stochastic Gradient Descent (BSGD) is a state-of-the-art technique for training large-scale kernelized support vector machines. The budget constraint is maintained incrementally by merging two points whenever the pre-defined budget is exceeded. The process of finding suitable merge partners is costly; it can account for up to 45% of the total training time. In this paper we investigate computationally more efficient schemes that merge more than two points at once. We obtain significant speed-ups without sacrificing accuracy.

artificial intelligence, machine learning, support vector, (12 more...)

1806.10179

Country: Europe > Germany (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Glasmachers, Tobias, Qaadan, Sahar

Speeding Up Budgeted Stochastic Gradient Descent SVM Training with Precomputed Golden Section Search

Limiting the model size of a kernel support vector machine to a pre-defined budget is a well-established technique that allows to scale SVM learning and prediction to large-scale data. Its core addition to simple stochastic gradient training is budget maintenance through merging of support vectors. This requires solving an inner optimization problem with an iterative method many times per gradient step. In this paper we replace the iterative procedure with a fast lookup. We manage to reduce the merging time by up to 65% and the total training time by 44% without any loss of accuracy.

artificial intelligence, machine learning, support vector, (11 more...)

1806.1018

Country: Europe > Germany (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Smith, Samuel L., Duckworth, Daniel, Le, Quoc V., Sohl-Dickstein, Jascha

Stochastic natural gradient descent draws posterior samples in function space

arXiv.org Artificial IntelligenceJun-25-2018

Natural gradient descent (NGD) minimises the cost function on a Riemannian manifold whose metric is defined by the Fisher information. In this work, we prove that if the model predictions on the training set approach the true conditional distribution of labels given inputs, then the noise inherent in minibatch gradients causes the stationary distribution of NGD to approach a Bayesian posterior, whose temperature $T \approx \epsilon N/(2B)$ is controlled by the learning rate $\epsilon$, training set size $N$ and batch size $B$. The parameter-dependence of the Fisher metric introduces an implicit prior over the parameters, which we identify as the well-known Jeffreys prior. To support our claims, we show that the distribution of samples from NGD is close to the Laplace approximation to the posterior when $T = 1$. Furthermore, the test loss of ensembles drawn using NGD falls rapidly as we increase the batch size until $B \approx \epsilon N/2$, while above this point the test loss is constant or rises slowly.

artificial intelligence, machine learning, posterior, (15 more...)

arXiv.org Artificial Intelligence

1806.09597

Genre: Research Report > New Finding (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)

Zhou, Yichen, Hooker, Giles

Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution

arXiv.org Machine LearningJun-25-2018

This paper presents a theoretical study of gradient boosted trees (GBT: Friedman, 2001). Machine learning methods for prediction have generally been thought of as trading off both intelligibility and statistical uncertainty quantification in favor of accuracy. Recent results have started to provide a statistical understanding of methods based on ensembles of decision trees (Breiman et al., 1984). In particular, the consistency of methods related to Random Forests (RFs: Breiman, 2001) has been demonstrated in Biau (2012); Scornet et al. (2015) while Wager et al. (2014); Mentch and Hooker (2016); Wager and Athey (2017) and Athey et al. (2016) prove central limit theorems for RF predictions. These have then been used for tests of variable importance and nonparametric interactions in Mentch and Hooker (2017). In this paper, we extend this analysis to GBT. Analyses of RFs have relied on a subsampling structure to express the estimator in the form of a U-statistic from which central limit theorems can be derived. By contrast, GBT produces trees sequentially with the current tree depending on the values in those built previously, requiring a different analytical approach. While the algorithm proposed in Friedman (2001) is intended to be generally applicable to any loss function, in this paper we focus specifically on nonparametric regression (Stone, 1977, 1982).

artificial intelligence, imsart-ao ver, machine learning, (18 more...)

1806.09762

Country: North America > United States > New York > Tompkins County > Ithaca (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)