AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Mean Actor Critic

Asadi, Kavosh, Allen, Cameron, Roderick, Melrose, Mohamed, Abdel-rahman, Konidaris, George, Littman, Michael

arXiv.org Machine LearningSep-1-2017

We propose a new algorithm, Mean Actor-Critic (MAC), for discrete-action continuous-state reinforcement learning. MAC is a policy gradient algorithm that uses the agent's explicit representation of all action values to estimate the gradient of the policy, rather than using only the actions that were actually executed. This significantly reduces variance in the gradient updates and removes the need for a variance reduction baseline. We show empirical results on two control domains where MAC performs as well as or better than other policy gradient approaches, and on five Atari games, where MAC is competitive with state-of-the-art policy search algorithms.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Machine Learning

1709.00503

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (0.55)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Hierarchical loss for classification

Wu, Cinna, Tygert, Mark, LeCun, Yann

arXiv.org Machine LearningSep-1-2017

Failing to distinguish between a sheepdog and a skyscraper should be worse and penalized more than failing to distinguish between a sheepdog and a poodle; after all, sheepdogs and poodles are both breeds of dogs. However, existing metrics of failure (so-called "loss" or "win") used in textual or visual classification/recognition via neural networks seldom view a sheepdog as more similar to a poodle than to a skyscraper. We define a metric that, inter alia, can penalize failure to distinguish between a sheepdog and a skyscraper more than failure to distinguish between a sheepdog and a poodle. Unlike previously employed possibilities, this metric is based on an ultrametric tree associated with any given tree organization into a semantically meaningful hierarchy of a classifier's classes.

artificial intelligence, hierarchy, machine learning, (16 more...)

arXiv.org Machine Learning

1709.01062

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.31)

Add feedback

Intuitive Machine Learning : Gradient Descent Simplified

@machinelearnbotAug-24-2017, 21:20:07 GMT

This article was written by Roopam Upadhyay. Roopam is a seasoned professional of advanced analytics with more than a decade of experience in statistical modeling, data science, predictive analytics, optimization, & business consulting. They learn the same way as humans. Humans learn from experience and so do machines. For machines, experience is in the form of data.

artificial intelligence, gradient descent simplified, intuitive machine learning, (9 more...)

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

Determinantal Point Processes for Mini-Batch Diversification

Zhang, Cheng, Kjellstrom, Hedvig, Mandt, Stephan

arXiv.org Machine LearningAug-23-2017

We study a mini-batch diversification scheme for stochastic gradient descent (SGD). While classical SGD relies on uniformly sampling data points to form a mini-batch, we propose a non-uniform sampling scheme based on the Determinantal Point Process (DPP). The DPP relies on a similarity measure between data points and gives low probabilities to mini-batches which contain redundant data, and higher probabilities to mini-batches with more diverse data. This simultaneously balances the data and leads to stochastic gradients with lower variance. We term this approach Diversified Mini-Batch SGD (DM-SGD). We show that regular SGD and a biased version of stratified sampling emerge as special cases. Furthermore, DM-SGD generalizes stratified sampling to cases where no discrete features exist to bin the data into groups. We show experimentally that our method results more interpretable and diverse features in unsupervised setups, and in better classification accuracies in supervised setups.

artificial intelligence, dataset, machine learning, (19 more...)

arXiv.org Machine Learning

1705.00607

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)

Add feedback

Stochastic Gradient Descent for Relational Logistic Regression via Partial Network Crawls

Yang, Jiasen, Ribeiro, Bruno, Neville, Jennifer

arXiv.org Machine LearningAug-20-2017

Research in statistical relational learning has produced a number of methods for learning relational models from large-scale network data. While these methods have been successfully applied in various domains, they have been developed under the unrealistic assumption of full data access. In practice, however, the data are often collected by crawling the network, due to proprietary access, limited resources, and privacy concerns. Recently, we showed that the parameter estimates for relational Bayes classifiers computed from network samples collected by existing network crawlers can be quite inaccurate, and developed a crawl-aware estimation method for such models (Yang, Ribeiro, and Neville, 2017). In this work, we extend the methodology to learning relational logistic regression models via stochastic gradient descent from partial network crawls, and show that the proposed method yields accurate parameter estimates and confidence intervals.

artificial intelligence, confidence interval, machine learning, (16 more...)

arXiv.org Machine Learning

1707.07716

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

Weighted parallel SGD for distributed unbalanced-workload training system

Daning, Cheng, Shigang, Li, Yunquan, Zhang

arXiv.org Machine LearningAug-16-2017

Stochastic gradient descent (SGD) is a popular stochastic optimization method in machine learning. Traditional parallel SGD algorithms, e.g., SimuParallel SGD [1], often require all nodes to have the same performance or to consume equal quantities of data. However, these requirements are difficult to satisfy when the parallel SGD algorithms run in a heterogeneous computing environment; low-performance nodes will exert a negative influence on the final result. In this paper, we propose an algorithm called weighted parallel SGD (WP-SGD). WP-SGD combines weighted model parameters from different nodes in the system to produce the final output. WP-SGD makes use of the reduction in standard deviation to compensate for the loss from the inconsistency in performance of nodes in the cluster, which means that WP-SGD does not require that all nodes consume equal quantities of data. We also analyze the theoretical feasibility of running two other parallel SGD algorithms combined with WP-SGD in a heterogeneous environment. The experimental results show that WP-SGD significantly outperforms the traditional parallel SGD algorithms on distributed training systems with an unbalanced workload. TEX Templates March 18, 2018 1. Introduction The training process in machine learning can essentially be treated as the solving of the stochastic optimization problem.

artificial intelligence, machine learning, wp-sgd, (15 more...)

arXiv.org Machine Learning

1708.04801

Country: North America > Canada > British Columbia (0.14)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

A Fixed-Point of View on Gradient Methods for Big Data

Jung, Alexander

arXiv.org Machine LearningAug-15-2017

Interpreting gradient methods as fixed-point iterations, we provide a detailed analysis of those methods for minimizing convex objective functions. Due to their conceptual and algorithmic simplicity, gradient methods are widely used in machine learning for massive data sets (big data). In particular, stochastic gradient methods are considered the de- facto standard for training deep neural networks. Studying gradient methods within the realm of fixed-point theory provides us with powerful tools to analyze their convergence properties. In particular, gradient methods using inexact or noisy gradients, such as stochastic gradient descent, can be studied conveniently using well-known results on inexact fixed-point iterations. Moreover, as we demonstrate in this paper, the fixed-point approach allows an elegant derivation of accelerations for basic gradient methods. In particular, we will show how gradient descent can be accelerated by a fixed-point preserving transformation of an operator associated with the objective function.

artificial intelligence, iteration, machine learning, (17 more...)

arXiv.org Machine Learning

1706.0988

Country:

North America > United States (0.46)
Europe (0.28)

Genre:

Instructional Material (0.46)
Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)

Add feedback

Simulated Annealing with Levy Distribution for Fast Matrix Factorization-Based Collaborative Filtering

Shehata, Mostafa A., Nassef, Mohammad, Badr, Amr A.

arXiv.org Machine LearningAug-9-2017

Matrix factorization is one of the best approaches for collaborative filtering, because of its high accuracy in presenting users and items latent factors. The main disadvantages of matrix factorization are its complexity, and being very hard to be parallelized, specially with very large matrices. In this paper, we introduce a new method for collaborative filtering based on Matrix Factorization by combining simulated annealing with levy distribution. By using this method, good solutions are achieved in acceptable time with low computations, compared to other methods like stochastic gradient descent, alternating least squares, and weighted non-negative matrix factorization.

artificial intelligence, machine learning, matrix, (14 more...)

arXiv.org Machine Learning

1708.02867

Country:

Europe (0.28)
Africa > Middle East > Egypt (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

When Not to Use Deep Learning

@machinelearnbotAug-6-2017, 03:40:05 GMT

There is also an aspect of deep learning models that I see gets sort of lost in translation when coming from other fields of machine learning. Most tutorials and introductory material to deep learning describe these models as composed by hierarchically-connected layers of nodes where the first layer is the input and the last layer is the output and that you can train them using some form of stochastic gradient descent. After maybe some brief mentions on how stochastic gradient descent works and what backpropagation is, the bulk of the explanation focuses on the rich landscape of neural network types (convolutional, recurrent, etc.). The optimization methods themselves receive little additional attention, which is unfortunate since it's likely that a big (if not the biggest) part of why deep learning works is because of those particular methods (check out, e.g. this post from Ferenc Huszár's and this paper taken from that post), and knowing how to optimize their parameters and how to partition data to use them effectively is crucial to get good convergence in a reasonable amount of time. Exactly why stochastic gradients matter so much is still unknown, but some clues are emerging here and there.

artificial intelligence, machine learning, use deep learning, (4 more...)

@machinelearnbot

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Identifying global optimality for dictionary learning

Le, Lei, White, Martha

arXiv.org Machine LearningAug-6-2017

Learning new representations of input observations in machine learning is often tackled using a factorization of the data. For many such problems, including sparse coding and matrix completion, learning these factorizations can be difficult, in terms of efficiency and to guarantee that the solution is a global minimum. Recently, a general class of objectives have been introduced--which we term induced dictionary learning models (DLMs)--that have an induced convex form that enables global optimization. Though attractive theoretically, this induced form is impractical, particularly for large or growing datasets. In this work, we investigate the use of practical alternating minimization algorithms for induced DLMs, that ensure convergence to global optima. We characterize the stationary points of these models, and, using these insights, highlight practical choices for the objectives. We then provide theoretical and empirical evidence that alternating minimization, from a random initialization, converges to global minima for a large subclass of induced DLMs. In particular, we take advantage of the existence of the (potentially unknown) convex induced form, to identify when stationary points are global minima for the dictionary learning objective. We then provide an empirical investigation into practical optimization choices for using alternating minimization for induced DLMs, for both batch and stochastic gradient descent.

artificial intelligence, machine learning, regularizer, (16 more...)

arXiv.org Machine Learning

1604.04942

Country: North America (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Add feedback