Goto

Collaborating Authors

 Gradient Descent


Gradient Update #9: Bias Bounties and Hierarchical Architectures for Computer Vision

#artificialintelligence

Welcome to the ninth update from the Gradient! If you were referred by a friend, subscribe and follow us on Twitter! This news edition's story is Sharing learnings from the first algorithmic bias bounty challenge. Summary Twitter's algorithmic bias bounty challenge, the first of its kind, recently concluded. While users had previously found the algorithm had a racial bias, the bounty uncovered a number of other biases and potential harms.


Simulated annealing for optimization of graphs and sequences

arXiv.org Artificial Intelligence

Optimization of discrete structures aims at generating a new structure with the better property given an existing one, which is a fundamental problem in machine learning. Different from the continuous optimization, the realistic applications of discrete optimization (e.g., text generation) are very challenging due to the complex and long-range constraints, including both syntax and semantics, in discrete structures. In this work, we present SAGS, a novel Simulated Annealing framework for Graph and Sequence optimization. The key idea is to integrate powerful neural networks into metaheuristics (e.g., simulated annealing, SA) to restrict the search space in discrete optimization. We start by defining a sophisticated objective function, involving the property of interest and pre-defined constraints (e.g., grammar validity). SAGS searches from the discrete space towards this objective by performing a sequence of local edits, where deep generative neural networks propose the editing content and thus can control the quality of editing. We evaluate SAGS on paraphrase generation and molecule generation for sequence optimization and graph optimization, respectively. Extensive results show that our approach achieves state-of-the-art performance compared with existing paraphrase generation methods in terms of both automatic and human evaluations. Further, SAGS also significantly outperforms all the previous methods in molecule generation.


Scalable Rule-Based Representation Learning for Interpretable Classification

arXiv.org Artificial Intelligence

Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to improve performance, but they sacrifice the model interpretability. To obtain both good scalability and interpretability, we propose a new classifier, named Rule-based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation and classification. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. An improved design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on nine small and four large data sets show that RRL outperforms the competitive interpretable approaches and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios. Our code is available at: https://github.com/12wang3/rrl.


Gradient Descent: Taking a Different View

#artificialintelligence

I had my first encounter with the Gradient Descent algorithm when I was learning about Linear Regression for the very first time. I devoured information about Gradient Descent as much as I could. Scouring through the internet looking for an explanation that would satisfy me. The most common explanation I found was analogous to the "going downhill on a cliff" experience. While this was really intuitive and easily comprehensible.


Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization

arXiv.org Artificial Intelligence

Several methods have been proposed to solve such derivative-free stochastic optimization problems, and we refer the reader to [3, 38] for surveys of these methods. A popular class of these methods estimate the gradients using function values and employ standard gradient-based optimization methods using these estimators. Quasi-Newton methods are recognized as one of the most powerful methods for solving deterministic optimization problems. These methods build quadratic models of the objective information using only gradient information. Recently, researchers have been adapting these methods for stochastic settings when the gradient information is available. The empirical results in [15] indicate that a careful implementation of these methods can be efficient compared with the popular stochastic gradient methods. We adapt these methods to make them suitable for situations where the gradients are estimated using function values. We propose finite-difference derivative-free stochastic quasi-Newton methods for solving (1) by exploiting common random number (CRN) evaluations of f.


Learning Generative Deception Strategies in Combinatorial Masking Games

arXiv.org Artificial Intelligence

Deception is a crucial tool in the cyberdefence repertoire, enabling defenders to leverage their informational advantage to reduce the likelihood of successful attacks. One way deception can be employed is through obscuring, or masking, some of the information about how systems are configured, increasing attacker's uncertainty about their targets. We present a novel game-theoretic model of the resulting defender-attacker interaction, where the defender chooses a subset of attributes to mask, while the attacker responds by choosing an exploit to execute. The strategies of both players have combinatorial structure with complex informational dependencies, and therefore even representing these strategies is not trivial. First, we show that the problem of computing an equilibrium of the resulting zero-sum defender-attacker game can be represented as a linear program with a combinatorial number of system configuration variables and constraints, and develop a constraint generation approach for solving this problem. Next, we present a novel highly scalable approach for approximately solving such games by representing the strategies of both players as neural networks. The key idea is to represent the defender's mixed strategy using a deep neural network generator, and then using alternating gradient-descent-ascent algorithm, analogous to the training of Generative Adversarial Networks. Our experiments, as well as a case study, demonstrate the efficacy of the proposed approach.


Types of Multi Classification

#artificialintelligence

This blog introduces different types of multi classification systems. Multiclass classifiers can distinguish between more than two classes other than binary classifiers. Stochastic gradient descent (SGD) classifiers, Random Forest classifiers, and naive Bayes classifiers etc. are capable of handling multiple classes natively. On the other hand, Logistic Regression or Support Vector Machine classifiers are strictly binary classifiers. There are various strategies that you can use to perform multiclass classification with multiple binary classifiers.


On the equivalence of different adaptive batch size selection strategies for stochastic gradient descent methods

arXiv.org Machine Learning

In this study, we demonstrate that the norm test and inner product/orthogonality test presented in \cite{Bol18} are equivalent in terms of the convergence rates associated with Stochastic Gradient Descent (SGD) methods if $\epsilon^2=\theta^2+\nu^2$ with specific choices of $\theta$ and $\nu$. Here, $\epsilon$ controls the relative statistical error of the norm of the gradient while $\theta$ and $\nu$ control the relative statistical error of the gradient in the direction of the gradient and in the direction orthogonal to the gradient, respectively. Furthermore, we demonstrate that the inner product/orthogonality test can be as inexpensive as the norm test in the best case scenario if $\theta$ and $\nu$ are optimally selected, but the inner product/orthogonality test will never be more computationally affordable than the norm test if $\epsilon^2=\theta^2+\nu^2$. Finally, we present two stochastic optimization problems to illustrate our results.


Optimization Strategies in Multi-Task Learning: Averaged or Separated Losses?

arXiv.org Artificial Intelligence

In Multi-Task Learning (MTL), it is a common practice to train multi-task networks by optimizing an objective function, which is a weighted average of the task-specific objective functions. Although the computational advantages of this strategy are clear, the complexity of the resulting loss landscape has not been studied in the literature. Arguably, its optimization may be more difficult than a separate optimization of the constituting task-specific objectives. In this work, we investigate the benefits of such an alternative, by alternating independent gradient descent steps on the different task-specific objective functions and we formulate a novel way to combine this approach with state-of-the-art optimizers. As the separation of task-specific objectives comes at the cost of increased computational time, we propose a random task grouping as a trade-off between better optimization and computational efficiency. Experimental results over three well-known visual MTL datasets show better overall absolute performance on losses and standard metrics compared to an averaged objective function and other state-of-the-art MTL methods. In particular, our method shows the most benefits when dealing with tasks of different nature and it enables a wider exploration of the shared parameter space. We also show that our random grouping strategy allows to trade-off between these benefits and computational efficiency.


A Novel Structured Natural Gradient Descent for Deep Learning

arXiv.org Artificial Intelligence

In order to perform calculations faster, we need to find natural gradient algorithms with low computational complexity and low storage requirements. This paper proposes a structured natural gradient optimization method (SNGD) for learning deep neural networks. SNGD first reconfigures the parameter layer of the deep network by adding a new processing layer (named local Fisher layer); and then optimizes the reconstructed network model based on traditional GD, which is equivalent to the optimization of the original network using NGD, thus effectively reducing the computational complexity of NGD. With the introduction of the local Fisher layer, the curvature information of the loss function space can be captured, and an adjustment related to the spatial curvature is added to the original gradient direction, which ensures that there is a reasonable parameter change in each update during optimization, and improves the convergence speed of the parameters. We test the proposed approach on… The main contributions of this paper are as follows: 1) By adding a new local Fisher layer to reconstruct the network, the relevant calculation of the global Fisher matrix is decomposed and finally transformed into the use of traditional GD for optimization to achieve the effect of NGD. 2) A new layer - local Fisher layer and its efficient implementation scheme are proposed. Through the introduction of the second-order information, the local Fisher layer considers the different attributes of different positions of the parameters, and adds constraints to the transformation of the model parameters, so that the gradient update can be carried out stably and quickly.