Goto

Collaborating Authors

 Gradient Descent


Gumbel-softmax Optimization: A Simple General Framework for Combinatorial Optimization Problems on Graphs

arXiv.org Machine Learning

Many problems in real life can be converted to combinatorial optimization problems (COPs) on graphs, that is to find a best node state configuration or a network structure such that the designed objective function is optimized under some constraints. However, these problems are notorious for their hardness to solve because most of them are NP-hard or NP-complete. Although traditional general methods such as simulated annealing (SA), genetic algorithms (GA) and so forth have been devised to these hard problems, their accuracy and time consumption are not satisfying in practice. In this work, we proposed a simple, fast, and general algorithm framework called Gumbel-softmax Optimization (GSO) for COPs. By introducing Gumbel-softmax technique which is developed in machine learning community, we can optimize the objective function directly by gradient descent algorithm regardless of the discrete nature of variables. We test our algorithm on four different problems including Sherrington-Kirkpatrick (SK) model, maximum independent set (MIS) problem, modularity optimization, and structural optimization problem. High-quality solutions can be obtained with much less time consuming compared to traditional approaches.


Classify Clothes Using Python and Artificial Neural Networks

#artificialintelligence

In this article, I will show you how to classify clothes from the Fashion MNIST data set using the python programming language and a machine learning technique called Artificial Neural Networks! If you prefer not to read this article and would like a video representation of it, you can check out the video below. It goes through everything in this article with a little more detail and will help make it easy for you to start programming your own Artificial Neural Network (ANN) model even if you don't have the programming language Python installed on your computer. Or you can use both the video and this article as supplementary materials for learning about ANN's! First I will write a description of what this program will do.


Riemannian batch normalization for SPD neural networks

arXiv.org Machine Learning

Covariance matrices have attracted attention for machine learning applications due to their capacity to capture interesting structure in the data. The main challenge is that one needs to take into account the particular geometry of the Riemannian manifold of symmetric positive definite (SPD) matrices they belong to. In the context of deep networks, several architectures for these matrices have recently been proposed. In our article, we introduce a Riemannian batch normalization (batchnorm) algorithm, which generalizes the one used in Euclidean nets. This novel layer makes use of geometric operations on the manifold, notably the Riemannian barycenter, parallel transport and non-linear structured matrix transformations. We derive a new manifold-constrained gradient descent algorithm working in the space of SPD matrices, allowing to learn the batchnorm layer. We validate our proposed approach with experiments in three different contexts on diverse data types: a drone recognition dataset from radar observations, and on emotion and action recognition datasets from video and motion capture data. Experiments show that the Riemannian batchnorm systematically gives better classification performance compared with leading methods and a remarkable robustness to lack of data.


Regression Metrics' Guide - Open Source Leader in AI and ML

#artificialintelligence

The target's distribution is right skewed with some fairly high values compared to the mean: The Root Mean Squared Error (RMSE) or Mean Squared Error (MSE, which is basically the same as RMSE without the squared root) is the most popular regression metric. If there was a king/queen of regression metrics, this would have been it! Where y i is the prediction and yi the actual target value. In other words, you square all the errors (or residuals as they call them) per sample/row, then sum them, divide by the total number of observations and take the squared root to bring the metric back to the original space (or you don't in MSE). It is also one of the oldest regression metrics. Smaller errors (that are for example less than 1.) will have an even lower contribution to the overall error after being squared, whereas bigger errors will have much more weight after being squared. A large error in a given sample can have huge impact on the overall results and make an optimizer focus on reducing the error for that single sample, making the prediction for every other sample worse. This is because of the "squared" attribute, it makes it easily differentiable, something that gradient-based algorithms (like Stochastic Gradient Descent) can leverage.


The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication

arXiv.org Machine Learning

We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients---compressed gradients under error compensation and for localSGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.


Implicit Regularization for Optimal Sparse Recovery

arXiv.org Machine Learning

We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an underdetermined system of linear measurements under the restricted isometry assumption. For a given parametrization yielding a non-convex optimization problem, we show that prescribed choices of initialization, step size and stopping time yield a statistically and computationally optimal algorithm that achieves the minimax rate with the same cost required to read the data up to poly-logarithmic factors. Beyond minimax optimality, we show that our algorithm adapts to instance difficulty and yields a dimension-independent rate when the signal-to-noise ratio is high enough. Key to the computational efficiency of our method is an increasing step size scheme that adapts to refined estimates of the true solution. We validate our findings with numerical experiments and compare our algorithm against explicit $\ell_{1}$ penalization. Going from hard instances to easy ones, our algorithm is seen to undergo a phase transition, eventually matching least squares with an oracle knowledge of the true support.


Logarithmic Regret for Online Control

arXiv.org Machine Learning

We study optimal regret bounds for control in linear dynamical systems under adversarially changing strongly convex cost functions, given the knowledge of transition dynamics. This includes several well studied and fundamental frameworks such as the Kalman filter and the linear quadratic regulator. State of the art methods achieve regret which scales as $O(\sqrt{T})$, where $T$ is the time horizon. We show that the optimal regret in this setting can be significantly smaller, scaling as $O(\text{poly}(\log T))$. This regret bound is achieved by two different efficient iterative methods, online gradient descent and online natural gradient.


Sparse and Imperceivable Adversarial Attacks

arXiv.org Machine Learning

Neural networks have been proven to be vulnerable to a variety of adversarial attacks. From a safety perspective, highly sparse adversarial attacks are particularly dangerous. On the other hand the pixelwise perturbations of sparse attacks are typically large and thus can be potentially detected. We propose a new black-box technique to craft adversarial examples aiming at minimizing $l_0$-distance to the original image. Extensive experiments show that our attack is better or competitive to the state of the art. Moreover, we can integrate additional bounds on the componentwise perturbation. Allowing pixels to change only in region of high variation and avoiding changes along axis-aligned edges makes our adversarial examples almost non-perceivable. Moreover, we adapt the Projected Gradient Descent attack to the $l_0$-norm integrating componentwise constraints. This allows us to do adversarial training to enhance the robustness of classifiers against sparse and imperceivable adversarial manipulations.


Learning Vector-valued Functions with Local Rademacher Complexity

arXiv.org Machine Learning

Abstract--We consider a general family of problems of which the output space admits vector-valued structure, covering a broad family of important domains, e.g. By using local Rademacher complexity and unlabeled data, we derived novel data-dependent excess risk bounds for vector-valued functions in both linear space and kernel space. The proposed bounds are much sharper than existing bounds and can be applied into specific vector-valued tasks in terms of different hypotheses sets and loss functions. Theoretical analysis motivates us to devise a unified learning framework for vector-valued functions based which is solved by proximal gradient descent on the primal, achieving a much better tradeoff between accuracy and efficiency . Empirical results on several benchmark datasets show that the proposed algorithm outperforms compared methods significantly, which coincides with our theoretical analysis. Index Terms --Statistical Learning Theory, Local Rademacher Complexity, Vector-Valued Functions, Semi-Supervised Learning.null 1 I NTRODUCTION I N the supervised learning, learning vector-valued functions is to learn a predict model from training data with vector-valued labels instead of scalar-valued labels, including a wide range of important tasks, such as multi-task learning [1], [2], [3], multi-label learning [4], [5], multi-class classification [6], [7], ranking [8], [9] and so on. The first unified learning framework for vector-valued functions in reproducing kernel Hilbert space (RKHS) was proposed in [10]. Then, the unified framework was further developed in [11], [12] and extended to semi-supervised learning by manifold regularization [13], [14], [15]. While current research about vector-valued functions mainly focus on the algorithmic front, we study vector-valued functions from both theoretical perspective and algorithmic perspective. In this paper, we integrate our previous works in multi-classification [7], [16] and generalize the idea into vector-valued settings. W e make the paper a significant improvement based on those two conference papers, with clearer and more general theoretical results, additional technical details,and a unified learning framework. Wang are with Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China. As the most common and successful data-dependent tool, Rademacher complexity was firstly used to analysis generalization performance of multi-class tasks in [22] and further studied in [23], [24]. The convergence rate of Rademacher complexity based error bounds are usually O ( K/ n), where K and n are the number of classes and the size of labeled samples, respectively .


#003 D TF Gradient Descent in TensorFlow Master Data Science

#artificialintelligence

In this post we will see how to implement Gradient Descent using TensorFlow. Next, we will define our variable \(\omega \) and we will initialize it with \(-3 \). With the following peace of code we will also define our cost function \(J(\omega) (\omega – 3) 2 \). With the next two lines of code, we specify the initialization of our variables (here we have just one variable \(\omega \) and the gradient descent for minimizing our cost function with the learning rate of \(0.01 \). Then we will define a session as sess and we will run the init so we will initialize the variable \(\omega \).