Goto

Collaborating Authors

 Gradient Descent


Depth Without the Magic: Inductive Bias of Natural Gradient Descent

arXiv.org Machine Learning

In gradient descent, changing how we parametrize the model can lead to drastically different optimization trajectories, giving rise to a surprising range of meaningful inductive biases: identifying sparse classifiers or reconstructing low-rank matrices without explicit regularization. This implicit regularization has been hypothesised to be a contributing factor to good generalization in deep learning. However, natural gradient descent is approximately invariant to reparameterization, it always follows the same trajectory and finds the same optimum. The question naturally arises: What happens if we eliminate the role of parameterization, which solution will be found, what new properties occur? We characterize the behaviour of natural gradient flow in deep linear networks for separable classification under logistic loss and deep matrix factorization. Some of our findings extend to nonlinear neural networks with sufficient but finite over-parametrization. We demonstrate that there exist learning problems where natural gradient descent fails to generalize, while gradient descent with the right architecture performs well.


Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

arXiv.org Machine Learning

Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on hyperparameter estimation for the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full log-likelihood loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ for $K$ iterations, up to a statistical error term depending on the minibatch size. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or polynomial eigendecay which is satisfied by a wide range of kernels commonly used in GPs. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs.


Finding Useful Predictions by Meta-gradient Descent to Improve Decision-making

arXiv.org Artificial Intelligence

In computational reinforcement learning, a growing body of work seeks to express an agent's model of the world through predictions about future sensations. In this manuscript we focus on predictions expressed as General Value Functions: temporally extended estimates of the accumulation of a future signal. One challenge is determining from the infinitely many predictions that the agent could possibly make which might support decision-making. In this work, we contribute a meta-gradient descent method by which an agent can directly specify what predictions it learns, independent of designer instruction. To that end, we introduce a partially observable domain suited to this investigation. We then demonstrate that through interaction with the environment an agent can independently select predictions that resolve the partial-observability, resulting in performance similar to expertly chosen value functions. By learning, rather than manually specifying these predictions, we enable the agent to identify useful predictions in a self-supervised manner, taking a step towards truly autonomous systems.


GFlowNet Foundations

arXiv.org Artificial Intelligence

Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately sample in proportion to a given reward function. In this paper, we show a number of additional theoretical properties of GFlowNets. They can be used to estimate joint probability distributions and the corresponding marginal distributions where some variables are unspecified and, of particular interest, can represent distributions over composite objects like sets and graphs. GFlowNets amortize the work typically done by computationally expensive MCMC methods in a single but trained generative pass. They could also be used to estimate partition functions and free energies, conditional probabilities of supersets (supergraphs) given a subset (subgraph), as well as marginal distributions over all supersets (supergraphs) of a given set (graph). We introduce variations enabling the estimation of entropy and mutual information, sampling from a Pareto frontier, connections to reward-maximizing policies, and extensions to stochastic environments, continuous actions and modular energy functions.


Network Generation with Differential Privacy

arXiv.org Artificial Intelligence

We consider the problem of generating private synthetic versions of real-world graphs containing private information while maintaining the utility of generated graphs. Differential privacy is a gold standard for data privacy, and the introduction of the differentially private stochastic gradient descent (DP-SGD) algorithm has facilitated the training of private neural models in a number of domains. Recent advances in graph generation via deep generative networks have produced several high performing models. We evaluate and compare state-of-the-art models including adjacency matrix based models and edge based models, and show a practical implementation that favours the edge-list approach utilizing the Gaussian noise mechanism, when evaluated on commonly used graph datasets. Based on our findings, we propose a generative model that can reproduce the properties of real-world networks while maintaining edge-differential privacy. The proposed model is based on a stochastic neural network that generates discrete edge-list samples and is trained using the Wasserstein GAN objective with the DP-SGD optimizer. Being the first approach to combine these beneficial properties, our model contributes to further research on graph data privacy.


PredProp: Bidirectional Stochastic Optimization with Precision Weighted Predictive Coding

arXiv.org Artificial Intelligence

We present PredProp, a method for bidirectional, parallel and local optimisation of weights, activities and precision in neural networks. PredProp jointly addresses inference and learning, scales learning rates dynamically and weights gradients by the curvature of the loss function by optimizing prediction error precision. PredProp optimizes network parameters with Stochastic Gradient Descent and error forward propagation based strictly on prediction errors and variables locally available to each layer. Neighboring layers optimise shared activity variables so that prediction errors can propagate forward in the network, while predictions propagate backwards. This process minimises the negative Free Energy, or evidence lower bound of the entire network. We show that networks trained with PredProp resemble gradient based predictive coding when the number of weights between neighboring activity variables is one. In contrast to related work, PredProp generalizes towards backward connections of arbitrary depth and optimizes precision for any deep network architecture. Due to the analogy between prediction error precision and the Fisher information for each layer, PredProp implements a form of Natural Gradient Descent. When optimizing DNN models, layer-wise PredProp renders the model a bidirectional predictive coding network. Alternatively DNNs can parameterize the weights between two activity variables. We evaluate PredProp for dense DNNs on simple inference, learning and combined tasks. We show that, without an explicit sampling step in the network, PredProp implements a form of variational inference that allows to learn disentangled embeddings from low amounts of data and leave evaluation on more complex tasks and datasets to future work.


Stochastic Gradient Line Bayesian Optimization: Reducing Measurement Shots in Optimizing Parameterized Quantum Circuits

arXiv.org Machine Learning

Optimization of parameterized quantum circuits is indispensable for applications of near-term quantum devices to computational tasks with variational quantum algorithms (VQAs). However, the existing optimization algorithms for VQAs require an excessive number of quantum-measurement shots in estimating expectation values of observables or iterating updates of circuit parameters, whose cost has been a crucial obstacle for practical use. To address this problem, we develop an efficient framework, \textit{stochastic gradient line Bayesian optimization} (SGLBO), for the circuit optimization with fewer measurement shots. The SGLBO reduces the cost of measurement shots by estimating an appropriate direction of updating the parameters based on stochastic gradient descent (SGD) and further by utilizing Bayesian optimization (BO) to estimate the optimal step size in each iteration of the SGD. We formulate an adaptive measurement-shot strategy to achieve the optimization feasibly without relying on precise expectation-value estimation and many iterations; moreover, we show that a technique of suffix averaging can significantly reduce the effect of statistical and hardware noise in the optimization for the VQAs. Our numerical simulation demonstrates that the SGLBO augmented with these techniques can drastically reduce the required number of measurement shots, improve the accuracy in the optimization, and enhance the robustness against the noise compared to other state-of-art optimizers in representative tasks for the VQAs. These results establish a framework of quantum-circuit optimizers integrating two different optimization approaches, SGD and BO, to reduce the cost of measurement shots significantly.


Route Optimization via Environment-Aware Deep Network and Reinforcement Learning

arXiv.org Artificial Intelligence

Taxicab service plays an essential and irreplaceable role in urban traffic system [Ji et al., 2020]. For example, in New York City, there are more than 21,000 taxi drivers and more than 80,000 ride-sharing drivers. Compared to other means of daily transportation, such as bus and subway, taxis usually offers a better trip experience in terms of comfort, convenience, and travel time accommodation. Thus, it has been a long-standing central issue to improve the efficiency of vehicle mobility by optimizing the route recommendation for drivers for taxi services in big cities like New York, Tokyo, and Beijing [Yuan et al., 2011, Zheng et al., 2014]. Based on large-scale taxi trace data, there is an extensive literature on route recommendation systems. Some studies focus on the traditional optimization method. For example, Qu et al. [2014] proposed a cost-efficient objective function and developed a greedy method to maximize the potential net profit. Similar methods can be found in [Ding et al., 2013, Zhou et al., 2016]. Stochastic optimization methods (e.g., simulated annealing -SA-) and parallel computing techniques have also been applied to route recommendation problems to speed up the route searching tasks (see [Ye This manuscript has been accepted by ACM Transactions on Intelligent Systems and Technology on April 25, 2021.


How Parallelization and Large Batch Size Improve the Performance of Deep Neural Networks.

#artificialintelligence

Large Batch Size had till recently been viewed as a deterrent for good accuracy. However recent studies show that increasing the batch size can significantly reduce the training time while maintaining a considerable level of accuracy. In this blog, we draw on our inferences from four such technical papers. The RMSprop Warm-up phase is used to address the optimization difficulty at the start of the training. The update rule demonstrated below utilizes both the Stochastic Gradient Descent (SGD) along the RMSprop optimization algorithm.


The Role of Adaptive Optimizers for Honest Private Hyperparameter Selection

arXiv.org Machine Learning

Hyperparameter optimization is a ubiquitous challenge in machine learning, and the performance of a trained model depends crucially upon their effective selection. While a rich set of tools exist for this purpose, there are currently no practical hyperparameter selection methods under the constraint of differential privacy (DP). We study honest hyperparameter selection for differentially private machine learning, in which the process of hyperparameter tuning is accounted for in the overall privacy budget. To this end, we i) show that standard composition tools outperform more advanced techniques in many settings, ii) empirically and theoretically demonstrate an intrinsic connection between the learning rate and clipping norm hyperparameters, iii) show that adaptive optimizers like DPAdam enjoy a significant advantage in the process of honest hyperparameter tuning, and iv) draw upon novel limiting behaviour of Adam in the DP setting to design a new and more efficient optimizer.