AITopics

1901.07592

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Italy > Lazio > Rome (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Gouk, Henry, Pfahringer, Bernhard, Frank, Eibe

Stochastic Gradient Trees

arXiv.org Machine LearningJan-23-2019

We present an online algorithm that induces decision trees using gradient information as the source of supervision. In contrast to previous approaches to gradient-based tree learning, we do not require soft splits or construction of a new tree for every update. In experiments, our method performs comparably to standard incremental classification trees and outperforms state of the art incremental regression trees. We also show how the method can be used to construct a novel type of neural network layer suited to learning representations from tabular data and find that it increases accuracy of multiclass and multi-label classification.

leaf node, loss function, neural network, (15 more...)

1901.07777

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

#artificialintelligenceJan-22-2019, 18:59:19 GMT

How to Configure the Learning Rate Hyperparameter When Training Deep Learning Neural Networks Plow

The weights of a neural network cannot be calculated using an analytical method. Instead, the weights must be discovered via an empirical optimization procedure called stochastic gradient descent.

artificial intelligence, deep learning neural network plow, machine learning, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

#artificialintelligenceJan-22-2019, 04:38:16 GMT

Difference between Batch Gradient Descent and Stochastic Gradient Descent

Now, what was the Gradient Descent algorithm? Above algorithm says, to perform the GD, we need to calculate the gradient of the cost function J. And to calculate the gradient of the cost function, we need to sum (yellow circle!) the cost of each sample. If we have 3 million samples, we have to loop through 3 million times or use the dot product. Do you see np.dot(X.T, y_hat-y) above?

artificial intelligence, gradient descent, machine learning, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Nguyen, Thanh Huy, Şimşekli, Umut, Richard, Gaël

Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization

arXiv.org Machine LearningJan-22-2019

Recent studies on diffusion-based sampling methods have shown that Langevin Monte Carlo (LMC) algorithms can be beneficial for non-convex optimization, and rigorous theoretical guarantees have been proven for both asymptotic and finite-time regimes. Algorithmically, LMC-based algorithms resemble the well-known gradient descent (GD) algorithm, where the GD recursion is perturbed by an additive Gaussian noise whose variance has a particular form. Fractional Langevin Monte Carlo (FLMC) is a recently proposed extension of LMC, where the Gaussian noise is replaced by a heavy-tailed {\alpha}-stable noise. As opposed to its Gaussian counterpart, these heavy-tailed perturbations can incur large jumps and it has been empirically demonstrated that the choice of {\alpha}-stable noise can provide several advantages in modern machine learning problems, both in optimization and sampling contexts. However, as opposed to LMC, only asymptotic convergence properties of FLMC have been yet established. In this study, we analyze the non-asymptotic behavior of FLMC for non-convex optimization and prove finite-time bounds for its expected suboptimality. Our results show that the weak-error of FLMC increases faster than LMC, which suggests using smaller step-sizes in FLMC. We finally extend our results to the case where the exact gradients are replaced by stochastic gradients and show that similar results hold in this setting as well.

denote, inequality, monte carlo, (16 more...)

1901.07487

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Sledge, Isaac J., Principe, Jose C.

An Exact Reformulation of Feature-Vector-based Radial-Basis-Function Networks for Graph-based Observations

arXiv.org Machine LearningJan-22-2019

Radial-basis-function networks are traditionally defined for sets of vector-based observations. In this short paper, we reformulate such networks so that they can be applied to adjacency-matrix representations of weighted, directed graphs that represent the relationships between object pairs. We re-state the sum-of-squares objective function so that it is purely dependent on entries from the adjacency matrix. From this objective function, we derive a gradient descent update for the network weights. We also derive a gradient update that simulates the repositioning of the radial basis prototypes and changes in the radial basis prototype parameters. An important property of our radial basis function networks is that they are guaranteed to yield the same responses as conventional radial-basis networks trained on a corresponding vector realization of the relationships encoded by the adjacency-matrix. Such a vector realization only needs to provably exist for this property to hold, which occurs whenever the relationships correspond to distances from some arbitrary metric applied to a latent set of vectors. We therefore completely avoid needing to actually construct vectorial realizations via multi-dimensional scaling, which ensures that the underlying relationships are totally preserved.

graph-based rbf network, prototype, rbf network, (13 more...)

1901.07484

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Florida > Alachua County > Gainesville (0.14)
North America > United States > Wisconsin (0.04)
(14 more...)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government (0.93)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.50)

#artificialintelligenceJan-21-2019, 14:28:35 GMT

r/MachineLearning - [D] Gradient Descent on (deterministic) Mean Absolute Error (L1 loss)

Gradient-based optimization of absolute errors is tricky, since the gradient is "never" zero. In theory, adaptive methods should be able to damp oscillations so that it converges to the minimum. However, I found none of the'standard' methods were able to do this "out of the box". Learning rate decay could alleviate the problem, but needs manual tuning which I would rather avoid. Does anyone know of a method that can do this?

artificial intelligence, machine learning, social media, (4 more...)

Industry: Media > News (0.40)

Technology:

Information Technology > Communications > Social Media (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Kreusser, Lisa Maria, Osher, Stanley J., Wang, Bao

A Deterministic Approach to Avoid Saddle Points

arXiv.org Machine LearningJan-21-2019

Loss functions with a large number of saddle points are one of the main obstacles to training many modern machine learning models. Gradient descent (GD) is a fundamental algorithm for machine learning and converges to a saddle point for certain initial data. We call the region formed by these initial values the "attraction region." For quadratic functions, GD converges to a saddle point if the initial data is in a subspace of up to n-1 dimensions. In this paper, we prove that a small modification of the recently proposed Laplacian smoothing gradient descent (LSGD) [Osher, et al., arXiv:1806.06317] contributes to avoiding saddle points without sacrificing the convergence rate of GD. In particular, we show that the dimension of the LSGD's attraction region is at most floor((n-1)/2) for a class of quadratic functions which is significantly smaller than GD's (n-1)-dimensional attraction region.

attraction region, eigenvalue, saddle point, (15 more...)

1901.06827

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.40)

Industry: Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

#artificialintelligenceJan-20-2019, 12:26:50 GMT

pushpull13/Gradient-Descent-Scratch

It is a gradient descent algorithm for classification implemented from scratch using numpy library. It is good practice to shuffle data at first numpy.random.shuffle() Mini Batch Size is size of input data flowing through network at a time for calculating error as a whole Learning Rate Alpha decides the rate at which, weights and biases will update while back propagation Number of Epochs decides number of times, the whole dataset will be used to train the network Set Mini Batch Size to 1/10th of total data available. And update it manually after every train of network to find its optimum value Alpha should be selected such that learning isn't very slow as well as it didn't take long jump or else, network will start diverging from local minima Number of epochs are selected such that network don't overfit itself over noise In ANN, output will depend on every neuron it pass through For output layer, we have label according to which, it is possible to find it's expected value But for all other layers, there is no single solution available So, finding optimum value is little harder for that

artificial intelligence, gradient-descent-scratch, machine learning, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.64)

Kalan, Seyed Mohammadreza Mousavi, Soltanolkotabi, Mahdi, Avestimehr, A. Salman

Fitting ReLUs via SGD and Quantized SGD

arXiv.org Machine LearningJan-19-2019

In this paper we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). These functions are of the form $\mathbf{x}\rightarrow \max(0,\langle\mathbf{w},\mathbf{x}\rangle)$ with $\mathbf{w}\in\mathbb{R}^d$ denoting the weight vector. We focus on a planted model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to a planted weight vector. We first show that mini-batch stochastic gradient descent when suitably initialized, converges at a geometric rate to the planted model with a number of samples that is optimal up to numerical constants. Next we focus on a parallel implementation where in each iteration the mini-batch gradient is calculated in a distributed manner across multiple processors and then broadcast to a master or all other processors. To reduce the communication cost in this setting we utilize a Quanitzed Stochastic Gradient Scheme (QSGD) where the partial gradients are quantized. Perhaps unexpectedly, we show that QSGD maintains the fast convergence of SGD to a globally optimal model while significantly reducing the communication cost. We further corroborate our numerical findings via various experiments including distributed implementations over Amazon EC2.

convergence, probability, theorem 3, (14 more...)

1901.06587

Country:

North America > United States > California (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Minnesota (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)