AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Probabilistic low rank matrix completion on finite alphabets

Neural Information Processing SystemsMar-13-2024, 14:02:31 GMT

The task of reconstructing a matrix given a sample of observed entries is known as the matrix completion problem. It arises in a wide range of problems, including recommender systems, collaborative filtering, dimensionality reduction, image processing, quantum physics or multi-class classification to name a few. Most works have focused on recovering an unknown real-valued low-rank matrix from randomly sub-sampling its entries. Here, we investigate the case where the observations take a finite number of values, corresponding for examples to ratings in recommender systems or labels in multi-class classification. We also consider a general sampling scheme (not necessarily uniform) over the matrix entries. The performance of a nuclear-norm penalized estimator is analyzed theoretically. More precisely, we derive bounds for the Kullback-Leibler divergence between the true and estimated distributions. In practice, we have also proposed an efficient algorithm based on lifted coordinate gradient descent in order to tackle potentially high dimensional settings.

completion, matrix, matrix completion, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > New York (0.04)
North America > United States > Maryland > Baltimore (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

Neural Information Processing SystemsMar-13-2024, 14:01:18 GMT

We improve a recent guarantee of Bach and Moulines on the linear convergence of SGD for smooth and strongly convex objectives, reducing a quadratic dependence on the strong convexity to a linear dependence. Furthermore, we show how reweighting the sampling distribution (i.e.

convergence, dependence, linear dependence, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Los Angeles County > Claremont (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)

Add feedback

Smoothed Gradients for Stochastic Variational Inference

Neural Information Processing SystemsMar-13-2024, 13:22:50 GMT

Stochastic variational inference (SVI) lets us scale up Bayesian computation to massive data. It uses stochastic optimization to fit a variational distribution, following easy-to-compute noisy natural gradients. As with most traditional stochastic optimization methods, SVI takes precautions to use unbiased stochastic gradients whose expectations are equal to the true gradients. In this paper, we explore the idea of following biased stochastic gradients in SVI. Our method replaces the natural gradient with a similarly constructed vector that uses a fixed-window moving average of some of its previous terms. We will demonstrate the many advantages of this technique. First, its computational cost is the same as for SVI and storage requirements only multiply by a constant factor. Second, it enjoys significant variance reduction over the unbiased estimates, smaller bias than averaged gradients, and leads to smaller mean-squared error against the full gradient. We test our method on latent Dirichlet allocation with three large corpora.

gradient, stochastic gradient, sufficient statistics, (11 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)

Add feedback

Stochastic Proximal Gradient Descent with Acceleration Techniques

Neural Information Processing SystemsMar-13-2024, 13:05:42 GMT

Proximal gradient descent (PGD) and stochastic proximal gradient descent (SPGD) are popular methods for solving regularized risk minimization problems in machine learning and statistics. In this paper, we propose and analyze an accelerated variant of these methods in the mini-batch setting. This method incorporates two acceleration techniques: one is Nesterov's acceleration method, and the other is a variance reduction for the stochastic gradient. Accelerated proximal gradient descent (APG) and proximal stochastic variance reduction gradient (Prox-SVRG) are in a trade-off relationship. We show that our method, with the appropriate mini-batch size, achieves lower overall complexity than both APG and Prox-SVRG.

acc-prox-svrg, complexity, gradient descent, (10 more...)

Neural Information Processing Systems

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Discriminative Metric Learning by Neighborhood Gerrymandering

Neural Information Processing SystemsMar-13-2024, 12:00:49 GMT

We formulate the problem of metric learning for k nearest neighbor classification as a large margin structured prediction problem, with a latent variable representing the choice of neighbors and the task loss directly corresponding to classification error. We describe an efficient algorithm for exact loss augmented inference, and a fast gradient descent algorithm for learning in this model. The objective drives the metric to establish neighborhood boundaries that benefit the true class labels for the training points. Our approach, reminiscent of gerrymandering (redrawing of political boundaries to provide advantage to certain parties), is more direct in its handling of optimizing classification accuracy than those previously proposed. In experiments on a variety of data sets our method is shown to achieve excellent results compared to current state of the art in metric learning.

inference, learning, neighbor, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Jordan (0.04)

Industry:

Government > Regional Government > North America Government > United States Government (0.71)
Government > Voting & Elections (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

Large-Margin Convex Polytope Machine

Neural Information Processing SystemsMar-13-2024, 12:00:14 GMT

We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive datasets, and augment it with a heuristic procedure to avoid sub-optimal local minima. Our experimental evaluations of the CPM on large-scale datasets from distinct domains (MNIST handwritten digit recognition, text topic, and web security) demonstrate that the CPM trains models faster, sometimes several orders of magnitude, than state-ofthe-art similar approaches and kernel-SVM methods while achieving comparable or better classification performance. Our empirical results suggest that, unlike prior similar approaches, we do not need to control the number of sub-classifiers (sides of the polytope) to avoid overfitting.

algorithm, classifier, dataset, (15 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > New York > New York County > New York City (0.05)
North America > United States > Virginia > Arlington County > Arlington (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

Scalable Kernel Methods via Doubly Stochastic Gradients

Neural Information Processing SystemsMar-13-2024, 11:01:24 GMT

The general perception is that kernel methods are not scalable, so neural nets become the choice for large-scale nonlinear learning problems. Have we tried hard enough for kernel methods? In this paper, we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Based on the fact that many kernel methods can be expressed as convex optimization problems, our approach solves the optimization problems by making two unbiased stochastic approximations to the functional gradient--one using random training points and another using random features associated with the kernel--and performing descent steps with this noisy functional gradient. Our algorithm is simple, need no commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting. We demonstrate that a function learned by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t), and achieves a generalization bound of O(1/ t). Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show competitive performances of our approach as compared to neural nets in datasets such as 2.3 million energy materials from MolecularSpace, 8 million handwritten digits from MNIST, and 1 million photos from ImageNet using convolution features.

algorithm, functional gradient, kernel method, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (0.35)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Kernel Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.52)

Add feedback

The limits of squared Euclidean distance regularization

Neural Information Processing SystemsMar-13-2024, 11:00:55 GMT

Some of the simplest loss functions considered in Machine Learning are the square loss, the logistic loss and the hinge loss. The most common family of algorithms, including Gradient Descent (GD) with and without Weight Decay, always predict with a linear combination of the past instances. We give a random construction for sets of examples where the target linear weight vector is trivial to learn but any algorithm from the above family is drastically sub-optimal. Our lower bound on the latter algorithms holds even if the algorithms are enhanced with an arbitrary kernel function. This type of result was known for the square loss. However, we develop new techniques that let us prove such hardness results for any loss function satisfying some minimal requirements on the loss function (including the three listed above). We also show that algorithms that regularize with the squared Euclidean distance are easily confused by random features. Finally, we conclude by discussing related open problems regarding feed forward neural networks. We conjecture that our hardness results hold for any training algorithm that is based on the squared Euclidean distance regularization (i.e.

algorithm, linear combination, matrix, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Cruz County > Santa Cruz (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Online and Stochastic Gradient Methods for Non-decomposable Loss Functions Microsoft Research, INDIA

Neural Information Processing SystemsMar-13-2024, 09:55:57 GMT

Modern applications in sensitive domains such as biometrics and medicine frequently require the use of non-decomposable loss functions such as precision@k, F-measure etc. Compared to point loss functions such as hinge-loss, these offer much more fine grained control over prediction, but at the same time present novel challenges in terms of algorithm design and analysis. In this work we initiate a study of online learning techniques for such non-decomposable loss functions with an aim to enable incremental learning as well as design scalable solvers for batch problems. To this end, we propose an online learning framework for such loss functions. Our model enjoys several nice properties, chief amongst them being the existence of efficient online learning algorithms with sublinear regret and online to batch conversion bounds. Our model is a provable extension of existing online learning models for point loss functions.

loss function, non-decomposable loss function, prec, (15 more...)

Neural Information Processing Systems

Country:

South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
North America > United States > California > Orange County > Irvine (0.04)
Asia > India > Karnataka > Bengaluru (0.04)

Industry: Health & Medicine (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning

Neural Information Processing SystemsMar-13-2024, 09:00:14 GMT

We analyze new online gradient descent algorithms for distributed systems with large delays between gradient computations and the corresponding updates. Using insights from adaptive gradient methods, we develop algorithms that adapt not only to the sequence of gradients, but also to the precise update delays that occur. We first give an impractical algorithm that achieves a regret bound that precisely quantifies the impact of the delays. We then analyze AdaptiveRevision, an algorithm that is efficiently implementable and achieves comparable guarantees. The key algorithmic technique is appropriately and efficiently revising the learning rate used for previous gradient steps. Experimental results show when the delays grow large (1000 updates or more), our new algorithms perform significantly better than standard adaptive gradient methods.

adaptiverevision, algorithm, gradient, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Education > Educational Setting > Online (0.51)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)

Add feedback