Goto

Collaborating Authors

 Gradient Descent


Friesen

AAAI Conferences

Continuous optimization is an important problem in many areas of AI, including vision, robotics, probabilistic inference, and machine learning. Unfortunately, most real-world optimization problems are nonconvex, causing standard convex techniques to find only local optima, even with extensions like random restarts and simulated annealing. We observe that, in many cases, the local modes of the objective function have combinatorial structure, and thus ideas from combinatorial optimization can be brought to bear. Based on this, we propose a problem-decomposition approach to nonconvex optimization. Similarly to DPLL-style SAT solvers and recursive conditioning in probabilistic inference, our algorithm, RDIS, recursively sets variables so as to simplify and decompose the objective function into approximately independent sub-functions, until the remaining functions are simple enough to be optimized by standard techniques like gradient descent. The variables to set are chosen by graph partitioning, ensuring decomposition whenever possible. We show analytically that RDIS can solve a broad class of nonconvex optimization problems exponentially faster than gradient descent with random restarts. Experimentally, RDIS outperforms standard techniques on problems like structure from motion and protein folding.


Flenner

AAAI Conferences

Integrating information from many different data sources to provide better situational awareness is an essential Navy issue. Many data fusion models use statistical methods to reduce statistical errors. Machine learning and big data provide, on the other hand, provides a unique framework for information fusion through our ability to learn what added benefits a different modality can provide. In this work, we provide a novel data fusion method that integrates relational data, provided to us in the form of a graph, and image data. We build an energy model that learns a representation of the data where different data sources are assumed to be similar using a graphical model. The energy model is a non-convex function which we optimize using stochastic gradient descent with momentum. The effectiveness of the model is demonstrated in an automated target recognition example.


Differentially Private Graph Classification with GNNs

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) have established themselves as the state-of-the-art models for many machine learning applications such as the analysis of social networks, protein interactions and molecules. Several among these datasets contain privacy-sensitive data. Machine learning with differential privacy is a promising technique to allow deriving insight from sensitive data while offering formal guarantees of privacy protection. However, the differentially private training of GNNs has so far remained under-explored due to the challenges presented by the intrinsic structural connectivity of graphs. In this work, we introduce differential privacy for graph-level classification, one of the key applications of machine learning on graphs. Our method is applicable to deep learning on multi-graph datasets and relies on differentially private stochastic gradient descent (DP-SGD). We show results on a variety of synthetic and public datasets and evaluate the impact of different GNN architectures and training hyperparameters on model performance for differentially private graph classification. Finally, we apply explainability techniques to assess whether similar representations are learned in the private and non-private settings and establish robust baselines for future work in this area.


Grassmann Stein Variational Gradient Descent

arXiv.org Machine Learning

Stein variational gradient descent (SVGD) is a deterministic particle inference algorithm that provides an efficient alternative to Markov chain Monte Carlo. However, SVGD has been found to suffer from variance underestimation when the dimensionality of the target distribution is high. Recent developments have advocated projecting both the score function and the data onto real lines to sidestep this issue, although this can severely overestimate the epistemic (model) uncertainty. In this work, we propose Grassmann Stein variational gradient descent (GSVGD) as an alternative approach, which permits projections onto arbitrary dimensional subspaces. Compared with other variants of SVGD that rely on dimensionality reduction, GSVGD updates the projectors simultaneously for the score function and the data, and the optimal projectors are determined through a coupled Grassmann-valued diffusion process which explores favourable subspaces. Both our theoretical and experimental results suggest that GSVGD enjoys efficient state-space exploration in high-dimensional problems that have an intrinsic low-dimensional structure.


Nesterov Accelerated Shuffling Gradient Method for Convex Optimization

arXiv.org Machine Learning

In this paper, we propose Nesterov Accelerated Shuffling Gradient (NASG), a new algorithm for the convex finite-sum minimization problems. Our method integrates the traditional Nesterov's acceleration momentum with different shuffling sampling schemes. We show that our algorithm has an improved rate of $\mathcal{O}(1/T)$ using unified shuffling schemes, where $T$ is the number of epochs. This rate is better than that of any other shuffling gradient methods in convex regime. Our convergence analysis does not require an assumption on bounded domain or a bounded gradient condition. For randomized shuffling schemes, we improve the convergence bound further. When employing some initial condition, we show that our method converges faster near the small neighborhood of the solution. Numerical simulations demonstrate the efficiency of our algorithm.


Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-Start

arXiv.org Machine Learning

We analyze a general class of bilevel problems, in which the upper-level problem consists in the minimization of a smooth objective function and the lower-level problem is to find the fixed point of a smooth contraction map. This type of problems include instances of meta-learning, hyperparameter optimization and data poisoning adversarial attacks. Several recent works have proposed algorithms which warm-start the lower-level problem, i.e. they use the previous lower-level approximate solution as a staring point for the lower-level solver. This warm-start procedure allows one to improve the sample complexity in both the stochastic and deterministic settings, achieving in some cases the order-wise optimal sample complexity. We show that without warm-start, it is still possible to achieve order-wise optimal and near-optimal sample complexity for the stochastic and deterministic settings, respectively. In particular, we propose a simple method which uses stochastic fixed point iterations at the lower-level and projected inexact gradient descent at the upper-level, that reaches an $\epsilon$-stationary point using $O(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-1})$ samples for the stochastic and the deterministic setting, respectively. Compared to methods using warm-start, ours is better suited for meta-learning and yields a simpler analysis that does not need to study the coupled interactions between the upper-level and lower-level iterates.


Anticorrelated Noise Injection for Improved Generalization

arXiv.org Machine Learning

Injecting artificial noise into gradient descent (GD) is commonly employed to improve the performance of machine learning models. Usually, uncorrelated noise is used in such perturbed gradient descent (PGD) methods. It is, however, not known if this is optimal or whether other types of noise could provide better generalization performance. In this paper, we zoom in on the problem of correlating the perturbations of consecutive PGD steps. We consider a variety of objective functions for which we find that GD with anticorrelated perturbations ("Anti-PGD") generalizes significantly better than GD and standard (uncorrelated) PGD. To support these experimental findings, we also derive a theoretical analysis that demonstrates that Anti-PGD moves to wider minima, while GD and PGD remain stuck in suboptimal regions or even diverge. This new connection between anticorrelated noise and generalization opens the field to novel ways to exploit noise for training machine learning models.


The Implicit Bias of Gradient Descent on Generalized Gated Linear Networks

arXiv.org Machine Learning

Understanding the asymptotic behavior of gradient-descent training of deep neural networks is essential for revealing inductive biases and improving network performance. We derive the infinite-time training limit of a mathematically tractable class of deep nonlinear neural networks, gated linear networks (GLNs), and generalize these results to gated networks described by general homogeneous polynomials. We study the implications of our results, focusing first on two-layer GLNs. We then apply our theoretical predictions to GLNs trained on MNIST and show how architectural constraints and the implicit bias of gradient descent affect performance. Finally, we show that our theory captures a substantial portion of the inductive bias of ReLU networks. By making the inductive bias explicit, our framework is poised to inform the development of more efficient, biologically plausible, and robust learning algorithms.


Differential Privacy Guarantees for Stochastic Gradient Langevin Dynamics

arXiv.org Machine Learning

We analyse the privacy leakage of noisy stochastic gradient descent by modeling R\'enyi divergence dynamics with Langevin diffusions. Inspired by recent work on non-stochastic algorithms, we derive similar desirable properties in the stochastic setting. In particular, we prove that the privacy loss converges exponentially fast for smooth and strongly convex objectives under constant step size, which is a significant improvement over previous DP-SGD analyses. We also extend our analysis to arbitrary sequences of varying step sizes and derive new utility bounds. Last, we propose an implementation and our experiments show the practical utility of our approach compared to classical DP-SGD libraries.


Backpropagation Neural Tree

arXiv.org Artificial Intelligence

We propose a novel algorithm called Backpropagation Neural Tree (BNeuralT), which is a stochastic computational dendritic tree. BNeuralT takes random repeated inputs through its leaves and imposes dendritic nonlinearities through its internal connections like a biological dendritic tree would do. Considering the dendritic-tree like plausible biological properties, BNeuralT is a single neuron neural tree model with its internal sub-trees resembling dendritic nonlinearities. BNeuralT algorithm produces an ad hoc neural tree which is trained using a stochastic gradient descent optimizer like gradient descent (GD), momentum GD, Nesterov accelerated GD, Adagrad, RMSprop, or Adam. BNeuralT training has two phases, each computed in a depth-first search manner: the forward pass computes neural tree's output in a post-order traversal, while the error backpropagation during the backward pass is performed recursively in a pre-order traversal. A BNeuralT model can be considered a minimal subset of a neural network (NN), meaning it is a "thinned" NN whose complexity is lower than an ordinary NN. Our algorithm produces high-performing and parsimonious models balancing the complexity with descriptive ability on a wide variety of machine learning problems: classification, regression, and pattern recognition.